net: ena: fix retrieval of nonadaptive interrupt moderation intervals
Nonadaptive interrupt moderation intervals are assigned the value set
by the user in ethtool -C divided by ena_dev->intr_delay_resolution.
Therefore when the user tries to get the nonadaptive interrupt moderation
intervals with ethtool -c the code needs to multiply the saved value
by ena_dev->intr_delay_resolution.
The current code erroneously divides instead of multiplying in ethtool -c.
This patch fixes this.
net: ena: fix update of interrupt moderation register
Current implementation always updates the interrupt register with
the smoothed_interval of the rx_ring. However this should be
done only in case of adaptive interrupt moderation. If non-adaptive
interrupt moderation is used, the non-adaptive interrupt moderation
interval should be used. This commit fixes that.
net: ena: remove old adaptive interrupt moderation code from ena_netdev
1. Out of the fields {per_napi_bytes, per_napi_packets} in struct ena_ring,
only rx_ring->per_napi_packets are used to determine if napi did work
for dim.
This commit removes all other uses of these fields.
2. Remove ena_ring->moder_tbl_idx, which is not used by dim.
3. Remove all calls to ena_com_destroy_interrupt_moderation(), since all it
did was to destroy the interrupt moderation table, which is removed as
part of removing old interrupt moderation code.
net: ena: enable the interrupt_moderation in driver_supported_features
Add driver_supported_features to host_host info which is a new API used to
communicate to the device which features are supported by the driver.
Add the interrupt_moderation bit to host_info->driver_supported_features
and enable it to signal the device that this driver supports interrupt
moderation properly.
Reserved bits are for features implemented in the future
1. Remove old adaptive interrupt moderation code from set/get_coalesce()
2. Add ena_update_rx_rings_intr_moderation() function for updating
nonadaptive interrupt moderation intervals similarly to
ena_update_tx_rings_intr_moderation().
3. Remove checks of multiple unsupported received interrupt coalescing
parameters. This makes code cleaner and cancels the need to update
it every time a new coalescing parameter is invented.
net: ena: add intr_moder_rx_interval to struct ena_com_dev and use it
Add intr_moder_rx_interval to struct ena_com_dev and use it as the
location where the interrupt moderation rx interval is saved, instead
of the interrupt moderation table.
This is done as a first step before removing the old interrupt moderation
code.
====================
ethtool: implement Energy Detect Powerdown support via phy-tunable
This changeset proposes a new control for PHY tunable to control Energy
Detect Power Down.
The `phy_tunable_id` has been named `ETHTOOL_PHY_EDPD` since it looks like
this feature is common across other PHYs (like EEE), and defining
`ETHTOOL_PHY_ENERGY_DETECT_POWER_DOWN` seems too long.
The way EDPD works, is that the RX block is put to a lower power mode,
except for link-pulse detection circuits. The TX block is also put to low
power mode, but the PHY wakes-up periodically to send link pulses, to avoid
lock-ups in case the other side is also in EDPD mode.
Currently, there are 2 PHY drivers that look like they could use this new
PHY tunable feature: the `adin` && `micrel` PHYs.
This series updates only the `adin` PHY driver to support this new feature,
as this chip has been tested. A change for `micrel` can be proposed after a
discussion of the PHY-tunable API is resolved.
====================
net: phy: adin: implement Energy Detect Powerdown mode via phy-tunable
This driver becomes the first user of the kernel's `ETHTOOL_PHY_EDPD`
phy-tunable feature.
EDPD is also enabled by default on PHY config_init, but can be disabled via
the phy-tunable control.
When enabling EDPD, it's also a good idea (for the ADIN PHYs) to enable TX
periodic pulses, so that in case the other PHY is also on EDPD mode, there
is no lock-up situation where both sides are waiting for the other to
transmit.
Via the phy-tunable control, TX pulses can be disabled if specifying 0
`tx-interval` via ethtool.
The ADIN PHY supports only fixed 1 second intervals; they cannot be
configured. That is why the acceptable values are 1,
ETHTOOL_PHY_EDPD_DFLT_TX_MSECS and ETHTOOL_PHY_EDPD_NO_TX (which disables
TX pulses).
ethtool: implement Energy Detect Powerdown support via phy-tunable
The `phy_tunable_id` has been named `ETHTOOL_PHY_EDPD` since it looks like
this feature is common across other PHYs (like EEE), and defining
`ETHTOOL_PHY_ENERGY_DETECT_POWER_DOWN` seems too long.
The way EDPD works, is that the RX block is put to a lower power mode,
except for link-pulse detection circuits. The TX block is also put to low
power mode, but the PHY wakes-up periodically to send link pulses, to avoid
lock-ups in case the other side is also in EDPD mode.
Currently, there are 2 PHY drivers that look like they could use this new
PHY tunable feature: the `adin` && `micrel` PHYs.
The ADIN's datasheet mentions that TX pulses are at intervals of 1 second
default each, and they can be disabled. For the Micrel KSZ9031 PHY, the
datasheet does not mention whether they can be disabled, but mentions that
they can modified.
The way this change is structured, is similar to the PHY tunable downshift
control:
* a `ETHTOOL_PHY_EDPD_DFLT_TX_MSECS` value is exposed to cover a default
TX interval; some PHYs could specify a certain value that makes sense
* `ETHTOOL_PHY_EDPD_NO_TX` would disable TX when EDPD is enabled
* `ETHTOOL_PHY_EDPD_DISABLE` will disable EDPD
As noted by the `ETHTOOL_PHY_EDPD_DFLT_TX_MSECS` the interval unit is 1
millisecond, which should cover a reasonable range of intervals:
- from 1 millisecond, which does not sound like much of a power-saver
- to ~65 seconds which is quite a lot to wait for a link to come up when
plugging a cable
xen-netfront: do not assume sk_buff_head list is empty in error handling
When skb_shinfo(skb) is not able to cache extra fragment (that is,
skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS), xennet_fill_frags() assumes
the sk_buff_head list is already empty. As a result, cons is increased only
by 1 and returns to error handling path in xennet_poll().
However, if the sk_buff_head list is not empty, queue->rx.rsp_cons may be
set incorrectly. That is, queue->rx.rsp_cons would point to the rx ring
buffer entries whose queue->rx_skbs[i] and queue->grant_rx_ref[i] are
already cleared to NULL. This leads to NULL pointer access in the next
iteration to process rx ring buffer entries.
Below is how xennet_poll() does error handling. All remaining entries in
tmpq are accounted to queue->rx.rsp_cons without assuming how many
outstanding skbs are remained in the list.
985 static int xennet_poll(struct napi_struct *napi, int budget)
... ...
1032 if (unlikely(xennet_set_skb_gso(skb, gso))) {
1033 __skb_queue_head(&tmpq, skb);
1034 queue->rx.rsp_cons += skb_queue_len(&tmpq);
1035 goto err;
1036 }
It is better to always have the error handling in the same way.
Fixes: ad4f15dc2c70 ("xen/netfront: don't bug in case of too many frags") Signed-off-by: Dongli Zhang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
There is a race condition that can occur when calling ena_down().
The ena_clean_tx_irq() - which is a part of the napi handler -
function might wake up the tx queue when the queue is supposed
to be down (during recovery or changing the size of the queues
for example) This causes the ena_start_xmit() function to trigger
and possibly try to access the destroyed queues.
When working in 'packet' mode, drop monitor generates a notification
with a potentially truncated payload of the dropped packet. The payload
is copied from the MAC header, but I forgot to check that the MAC header
was set, so do it now.
Patch #1 sets the offsets to the various protocol layers in netdevsim,
so that it will continue to work after the MAC header check is added to
drop monitor in patch #2.
====================
When working in 'packet' mode, drop monitor generates a notification
with a potentially truncated payload of the dropped packet. The payload
is copied from the MAC header, but I forgot to check that the MAC header
was set, so do it now.
Fixes: ca30707dee2b ("drop_monitor: Add packet alert mode") Fixes: 5e58109b1ea4 ("drop_monitor: Add support for packet alert mode for hardware drops") Acked-by: Jiri Pirko <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Mon, 16 Sep 2019 19:32:58 +0000 (21:32 +0200)]
Merge branch 'tc-taprio-offload-for-SJA1105-DSA'
Vladimir Oltean says:
====================
tc-taprio offload for SJA1105 DSA
This is the third attempt to submit the tc-taprio offload model for
inclusion in the networking tree. The sja1105 switch driver will provide
the first implementation of the offload. Only the bare minimum is added:
- The offload model and a DSA pass-through
- The hardware implementation
- The interaction with the netdev queues in the tagger code
- Documentation
What has been removed from previous attempts is support for
PTP-as-clocksource in sja1105, as well as configuring the traffic class
for management traffic. These will be added as soon as the offload
model is settled.
====================
Vladimir Oltean [Sun, 15 Sep 2019 02:00:02 +0000 (05:00 +0300)]
net: dsa: sja1105: Configure the Time-Aware Scheduler via tc-taprio offload
This qdisc offload is the closest thing to what the SJA1105 supports in
hardware for time-based egress shaping. The switch core really is built
around SAE AS6802/TTEthernet (a TTTech standard) but can be made to
operate similarly to IEEE 802.1Qbv with some constraints:
- The gate control list is a global list for all ports. There are 8
execution threads that iterate through this global list in parallel.
I don't know why 8, there are only 4 front-panel ports.
- Care must be taken by the user to make sure that two execution threads
never get to execute a GCL entry simultaneously. I created a O(n^4)
checker for this hardware limitation, prior to accepting a taprio
offload configuration as valid.
- The spec says that if a GCL entry's interval is shorter than the frame
length, you shouldn't send it (and end up in head-of-line blocking).
Well, this switch does anyway.
- The switch has no concept of ADMIN and OPER configurations. Because
it's so simple, the TAS settings are loaded through the static config
tables interface, so there isn't even place for any discussion about
'graceful switchover between ADMIN and OPER'. You just reset the
switch and upload a new OPER config.
- The switch accepts multiple time sources for the gate events. Right
now I am using the standalone clock source as opposed to PTP. So the
base time parameter doesn't really do much. Support for the PTP clock
source will be added in a future series.
Vladimir Oltean [Sun, 15 Sep 2019 02:00:01 +0000 (05:00 +0300)]
net: dsa: sja1105: Advertise the 8 TX queues
This is a preparation patch for the tc-taprio offload (and potentially
for other future offloads such as tc-mqprio).
Instead of looking directly at skb->priority during xmit, let's get the
netdev queue and the queue-to-traffic-class mapping, and put the
resulting traffic class into the dsa_8021q PCP field. The switch is
configured with a 1-to-1 PCP-to-ingress-queue-to-egress-queue mapping
(see vlan_pmap in sja1105_main.c), so the effect is that we can inject
into a front-panel's egress traffic class through VLAN tagging from
Linux, completely transparently.
Unfortunately the switch doesn't look at the VLAN PCP in the case of
management traffic to/from the CPU (link-local frames at
01-80-C2-xx-xx-xx or 01-1B-19-xx-xx-xx) so we can't alter the
transmission queue of this type of traffic on a frame-by-frame basis. It
is only selected through the "hostprio" setting which ATM is harcoded in
the driver to 7.
Vladimir Oltean [Sun, 15 Sep 2019 01:59:59 +0000 (04:59 +0300)]
net: dsa: Pass ndo_setup_tc slave callback to drivers
DSA currently handles shared block filters (for the classifier-action
qdisc) in the core due to what I believe are simply pragmatic reasons -
hiding the complexity from drivers and offerring a simple API for port
mirroring.
Extend the dsa_slave_setup_tc function by passing all other qdisc
offloads to the driver layer, where the driver may choose what it
implements and how. DSA is simply a pass-through in this case.
This allows taprio to offload the schedule enforcement to capable
network cards, resulting in more precise windows and less CPU usage.
The gate mask acts on traffic classes (groups of queues of same
priority), as specified in IEEE 802.1Q-2018, and following the existing
taprio and mqprio semantics.
It is up to the driver to perform conversion between tc and individual
netdev queues if for some reason it needs to make that distinction.
Full offload is requested from the network interface by specifying
"flags 2" in the tc qdisc creation command, which in turn corresponds to
the TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD bit.
The important detail here is the clockid which is implicitly /dev/ptpN
for full offload, and hence not configurable.
A reference counting API is added to support the use case where Ethernet
drivers need to keep the taprio offload structure locally (i.e. they are
a multi-port switch driver, and configuring a port depends on the
settings of other ports as well). The refcount_t variable is kept in a
private structure (__tc_taprio_qopt_offload) and not exposed to drivers.
In the future, the private structure might also be expanded with a
backpointer to taprio_sched *q, to implement the notification system
described in the patch (of when admin became oper, or an error occurred,
etc, so the offload can be monitored with 'tc qdisc show').
Steve French [Mon, 16 Sep 2019 03:38:52 +0000 (22:38 -0500)]
smb3: add missing worker function for SMB3 change notify
SMB3 change notify is important to allow applications to wait
on directory change events of different types (e.g. adding
and deleting files from others systems). Add worker functions
for this.
when mounting with modefromsid, we end up writing 4 ACE in a security
descriptor that only has room for 3, thus triggering an out-of-bounds
write. fix this by changing the min size of a security descriptor.
Steve French [Thu, 12 Sep 2019 22:52:54 +0000 (17:52 -0500)]
smb3: fix unmount hang in open_shroot
An earlier patch "CIFS: fix deadlock in cached root handling"
did not completely address the deadlock in open_shroot. This
patch addresses the deadlock.
In testing the recent patch:
smb3: improve handling of share deleted (and share recreated)
we were able to reproduce the open_shroot deadlock to one
of the target servers in unmount in a delete share scenario.
Fixes: 7e5a70ad88b1e ("CIFS: fix deadlock in cached root handling")
This is version 2 of this patch. An earlier version of this
patch "smb3: fix unmount hang in open_shroot" had a problem
found by Dan.
Steve French [Thu, 12 Sep 2019 02:46:20 +0000 (21:46 -0500)]
smb3: allow disabling requesting leases
In some cases to work around server bugs or performance
problems it can be helpful to be able to disable requesting
SMB2.1/SMB3 leases on a particular mount (not to all servers
and all shares we are mounted to). Add new mount parm
"nolease" which turns off requesting leases on directory
or file opens. Currently the only way to disable leases is
globally through a module load parameter. This is more
granular.
Steve French [Wed, 11 Sep 2019 05:07:36 +0000 (00:07 -0500)]
smb3: improve handling of share deleted (and share recreated)
When a share is deleted, returning EIO is confusing and no useful
information is logged. Improve the handling of this case by
at least logging a better error for this (and also mapping the error
differently to EREMCHG). See e.g. the new messages that would be logged:
In addition for the case where a share is deleted and then recreated
with the same name, have now fixed that so it works. This is sometimes
done for example, because the admin had to move a share to a different,
bigger local drive when a share is running low on space.
Steve French [Tue, 10 Sep 2019 03:57:11 +0000 (22:57 -0500)]
smb3: display max smb3 requests in flight at any one time
Displayed in /proc/fs/cifs/Stats once for each
socket we are connected to.
This allows us to find out what the maximum number of
requests that had been in flight (at any one time). Note that
/proc/fs/cifs/Stats can be reset if you want to look for
maximum over a small period of time.
Sample output (immediately after mount):
Resources in use
CIFS Session: 1
Share (unique mount targets): 2
SMB Request/Response Buffer: 1 Pool size: 5
SMB Small Req/Resp Buffer: 1 Pool size: 30
Operations (MIDs): 0
0 session 0 share reconnects
Total vfs operations: 5 maximum at one time: 2
Max requests in flight: 2
1) \\localhost\scratch
SMBs: 18
Bytes read: 0 Bytes written: 0
...
Ronnie Sahlberg [Mon, 9 Sep 2019 05:30:00 +0000 (15:30 +1000)]
cifs: add a helper to find an existing readable handle to a file
and convert smb2_query_path_info() to use it.
This will eliminate the need for a SMB2_Create when we already have an
open handle that can be used. This will also prevent a oplock break
in case the other handle holds a lease.
Steve French [Mon, 9 Sep 2019 04:22:02 +0000 (23:22 -0500)]
smb3: enable offload of decryption of large reads via mount option
Disable offload of the decryption of encrypted read responses
by default (equivalent to setting this new mount option "esize=0").
Allow setting the minimum encrypted read response size that we
will choose to offload to a worker thread - it is now configurable
via on a new mount option "esize="
Depending on which encryption mechanism (GCM vs. CCM) and
the number of reads that will be issued in parallel and the
performance of the network and CPU on the client, it may make
sense to enable this since it can provide substantial benefit when
multiple large reads are in flight at the same time.
Steve French [Sat, 7 Sep 2019 06:09:49 +0000 (01:09 -0500)]
smb3: allow parallelizing decryption of reads
decrypting large reads on encrypted shares can be slow (e.g. adding
multiple milliseconds per-read on non-GCM capable servers or
when mounting with dialects prior to SMB3.1.1) - allow parallelizing
of read decryption by launching worker threads.
Testing to Samba on localhost showed 25% improvement.
Testing to remote server showed very large improvement when
doing more than one 'cp' command was called at one time.
Steve French [Thu, 5 Sep 2019 04:07:52 +0000 (23:07 -0500)]
smb3: fix signing verification of large reads
Code cleanup in the 5.1 kernel changed the array
passed into signing verification on large reads leading
to warning messages being logged when copying files to local
systems from remote.
SMB signature verification returned error = -5
This changeset fixes verification of SMB3 signatures of large
reads.
Steve French [Wed, 4 Sep 2019 02:18:49 +0000 (21:18 -0500)]
smb3: allow skipping signature verification for perf sensitive configurations
Add new mount option "signloosely" which enables signing but skips the
sometimes expensive signing checks in the responses (signatures are
calculated and sent correctly in the SMB2/SMB3 requests even with this
mount option but skipped in the responses). Although weaker for security
(and also data integrity in case a packet were corrupted), this can provide
enough of a performance benefit (calculating the signature to verify a
packet can be expensive especially for large packets) to be useful in
some cases.
Steve French [Tue, 3 Sep 2019 23:35:42 +0000 (18:35 -0500)]
smb3: add dynamic tracepoints for flush and close
We only had dynamic tracepoints on errors in flush
and close, but may be helpful to trace enter
and non-error exits for those. Sample trace examples
(excerpts) from "cp" and "dd" show two of the new
tracepoints.
Steve French [Tue, 3 Sep 2019 22:49:46 +0000 (17:49 -0500)]
smb3: log warning if CSC policy conflicts with cache mount option
If the server config (e.g. Samba smb.conf "csc policy = disable)
for the share indicates that the share should not be cached, log
a warning message if forced client side caching ("cache=ro" or
"cache=singleclient") is requested on mount.
Steve French [Fri, 30 Aug 2019 07:12:41 +0000 (02:12 -0500)]
smb3: add mount option to allow RW caching of share accessed by only 1 client
If a share is known to be only to be accessed by one client, we
can aggressively cache writes not just reads to it.
Add "cache=" option (cache=singleclient) for mounting read write shares
(that will not be read or written to from other clients while we have
it mounted) in order to improve performance.
Steve French [Fri, 30 Aug 2019 03:33:38 +0000 (22:33 -0500)]
smb3: add some more descriptive messages about share when mounting cache=ro
Add some additional logging so the user can see if the share they
mounted with cache=ro is considered read only by the server
CIFS: Attempting to mount //localhost/test
CIFS VFS: mounting share with read only caching. Ensure that the share will not be modified while in use.
CIFS VFS: read only mount of RW share
CIFS: Attempting to mount //localhost/test-ro
CIFS VFS: mounting share with read only caching. Ensure that the share will not be modified while in use.
CIFS VFS: mounted to read only share
Steve French [Wed, 28 Aug 2019 04:58:54 +0000 (23:58 -0500)]
smb3: add mount option to allow forced caching of read only share
If a share is immutable (at least for the period that it will
be mounted) it would be helpful to not have to revalidate
dentries repeatedly that we know can not be changed remotely.
Add "cache=" option (cache=ro) for mounting read only shares
in order to improve performance in cases in which we know that
the share will not be changing while it is in use.
Colin Ian King [Mon, 2 Sep 2019 15:10:59 +0000 (16:10 +0100)]
cifs: fix dereference on ses before it is null checked
The assignment of pointer server dereferences pointer ses, however,
this dereference occurs before ses is null checked and hence we
have a potential null pointer dereference. Fix this by only
dereferencing ses after it has been null checked.
Addresses-Coverity: ("Dereference before null check") Fixes: 2808c6639104 ("cifs: add new debugging macro cifs_server_dbg") Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Steve French <[email protected]>
Ronnie Sahlberg [Wed, 28 Aug 2019 07:15:35 +0000 (17:15 +1000)]
cifs: add new debugging macro cifs_server_dbg
which can be used from contexts where we have a TCP_Server_Info *server.
This new macro will prepend the debugging string with "Server:<servername> "
which will help when debugging issues on hosts with many cifs connections
to several different servers.
Convert a bunch of cifs_dbg(VFS) calls to cifs_server_dbg(VFS)
Ronnie Sahlberg [Thu, 29 Aug 2019 22:25:46 +0000 (08:25 +1000)]
cifs: create a helper to find a writeable handle by path name
rename() takes a path for old_file and in SMB2 we used to just create
a compound for create(old_path)/rename/close().
If we already have a writable handle we can avoid the create() and close()
altogether and just use the existing handle.
For this situation, as we avoid doing the create()
we also avoid triggering an oplock break for the existing handle.
YueHaibing [Fri, 23 Aug 2019 12:15:35 +0000 (20:15 +0800)]
cifs: remove set but not used variables
Fixes gcc '-Wunused-but-set-variable' warning:
fs/cifs/file.c: In function cifs_lock:
fs/cifs/file.c:1696:24: warning: variable cinode set but not used [-Wunused-but-set-variable]
fs/cifs/file.c: In function cifs_write:
fs/cifs/file.c:1765:23: warning: variable cifs_sb set but not used [-Wunused-but-set-variable]
fs/cifs/file.c: In function collect_uncached_read_data:
fs/cifs/file.c:3578:20: warning: variable tcon set but not used [-Wunused-but-set-variable]
'cinode' is never used since introduced by
commit 03776f4516bc ("CIFS: Simplify byte range locking code")
'cifs_sb' is not used since commit cb7e9eabb2b5 ("CIFS: Use
multicredits for SMB 2.1/3 writes").
'tcon' is not used since commit d26e2903fc10 ("smb3: fix bytes_read statistics")
Colin Ian King [Wed, 31 Jul 2019 09:05:26 +0000 (10:05 +0100)]
cifs: remove redundant assignment to variable rc
Variable rc is being initialized with a value that is never read
and rc is being re-assigned a little later on. The assignment is
redundant and hence can be removed.
Steve French [Fri, 19 Jul 2019 08:15:55 +0000 (08:15 +0000)]
cifs: allow chmod to set mode bits using special sid
When mounting with "modefromsid" set mode bits (chmod) by
adding ACE with special SID (S-1-5-88-3-<mode>) to the ACL.
Subsequent patch will fix setting default mode on file
create and mkdir.
See See e.g.
https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2008-R2-and-2008/hh509017(v=ws.10)
Steve French [Fri, 19 Jul 2019 06:30:07 +0000 (06:30 +0000)]
cifs: get mode bits from special sid on stat
When mounting with "modefromsid" retrieve mode bits from
special SID (S-1-5-88-3) on stat. Subsequent patch will fix
setattr (chmod) to save mode bits in S-1-5-88-3-<mode>
Note that when an ACE matching S-1-5-88-3 is not found, we
default the mode to an approximation based on the owner, group
and everyone permissions (as with the "cifsacl" mount option).
See See e.g.
https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2008-R2-and-2008/hh509017(v=ws.10)
Colin Ian King [Tue, 23 Jul 2019 15:09:19 +0000 (16:09 +0100)]
fs: cifs: cifsssmb: remove redundant assignment to variable ret
The variable ret is being initialized however this is never read
and later it is being reassigned to a new value. The initialization
is redundant and hence can be removed.
IB/mlx5: Use the original address for the page during free_pages
The removal of 'buffer' in the patch below caused free_page() to use a
value that had been offset since the wqe pointer is adjusted while the
routine runs.
The current implementation of free_pages() rounds down to a pfn,
discarding the adjustment, but this is not the right way to use the
API. Preserve the initial value and use it for free_page().
Merge tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull pidfd/waitid updates from Christian Brauner:
"This contains two features and various tests.
First, it adds support for waiting on process through pidfds by adding
the P_PIDFD type to the waitid() syscall. This completes the basic
functionality of the pidfd api (cf. [1]). In the meantime we also have
a new adition to the userspace projects that make use of the pidfd
api. The qt project was nice enough to send a mail pointing out that
they have a pr up to switch to the pidfd api (cf. [2]).
Second, this tag contains an extension to the waitid() syscall to make
it possible to wait on the current process group in a race free manner
(even though the actual problem is very unlikely) by specifing 0
together with the P_PGID type. This extension traces back to a
discussion on the glibc development mailing list.
There are also a range of tests for the features above. Additionally,
the test-suite which detected the pidfd-polling race we fixed in [3]
is included in this tag"
[1] https://lwn.net/Articles/794707/
[2] https://codereview.qt-project.org/c/qt/qtbase/+/108456
[3] commit b191d6491be6 ("pidfd: fix a poll race when setting exit_state")
* tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
waitid: Add support for waiting for the current process group
tests: add pidfd poll tests
tests: move common definitions and functions into pidfd.h
pidfd: add pidfd_wait tests
pidfd: add P_PIDFD to waitid()
Chao Yu [Wed, 28 Aug 2019 09:33:37 +0000 (17:33 +0800)]
f2fs: fix to fallback to buffered IO in IO aligned mode
In LFS mode, we allow OPU for direct IO, however, we didn't consider
IO alignment feature, so direct IO can trigger unaligned IO, let's
just fallback to buffered IO to keep correct IO alignment semantics
in all places.
Fixes: f847c699cff3 ("f2fs: allow out-place-update for direct IO in LFS mode") Signed-off-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
Since commit f847c699cff3 ("f2fs: allow out-place-update for direct
IO in LFS mode"), we start to allow OPU mode for direct IO, however,
we missed to update extent cache in __allocate_data_block(), finally,
it cause extent field being inconsistent with physical block address,
fix it.
Fixes: f847c699cff3 ("f2fs: allow out-place-update for direct IO in LFS mode") Signed-off-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
Surbhi Palande [Fri, 23 Aug 2019 22:40:45 +0000 (15:40 -0700)]
f2fs: check all the data segments against all node ones
As a part of the sanity checking while mounting, distinct segment number
assignment to data and node segments is verified. Fixing a small bug in
this verification between node and data segments. We need to check all
the data segments with all the node segments.
Fixes: 042be0f849e5f ("f2fs: fix to do sanity check with current segment number") Signed-off-by: Surbhi Palande <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
This is similar to 942491c9e6d6 ("xfs: fix AIM7 regression")
Apparently our current rwsem code doesn't like doing the trylock, then
lock for real scheme. So change our read/write methods to just do the
trylock for the RWF_NOWAIT case.
We don't need a check for IOCB_NOWAIT and !direct-IO because it
is checked in generic_write_checks().
f2fs: fix to avoid accessing uninitialized field of inode page in is_alive()
If inode is newly created, inode page may not synchronize with inode cache,
so fields like .i_inline or .i_extra_isize could be wrong, in below call
path, we may access such wrong fields, result in failing to migrate valid
target block.
So, refactor atomic_write flow like this:
1. start_atomic_write
- add inmem_list and set atomic_file
2. write()
- register it in inmem_pages
3. commit_atomic_write
- if no error, f2fs_drop_inmem_pages()
- f2fs_commit_inmme_pages() failed
: __revoked_inmem_pages() was done
- f2fs_do_sync_file failed
: abort_atomic_write later
Russell King [Sat, 14 Sep 2019 09:44:04 +0000 (10:44 +0100)]
net: phylink: clarify where phylink should be used
Update the phylink documentation to make it clear that phylink is
designed to be used on the MAC facing side of the link, rather than
between a SFP and PHY.
A follow-up patchset for the recently added health and error recovery
feature. The first fix is to prevent .ndo_set_rx_mode() from proceeding
when reset is in progress. The 2nd fix is for the firmware coredump
command. The 3rd and 4th patches update the error recovery process
slightly to add a state that polls and waits for the firmware to be down.
====================
bnxt_en: Add a new BNXT_FW_RESET_STATE_POLL_FW_DOWN state.
This new state is required when firmware indicates that the error
recovery process requires polling for firmware state to be completely
down before initiating reset. For example, firmware may take some
time to collect the crash dump before it is down and ready to be
reset.
bnxt_en: Increase timeout for HWRM_DBG_COREDUMP_XX commands
Firmware coredump messages take much longer than standard messages,
so increase the timeout accordingly.
Fixes: 6c5657d085ae ("bnxt_en: Add support for ethtool get dump.") Signed-off-by: Vasundhara Volam <[email protected]> Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Michael Chan [Sat, 14 Sep 2019 04:01:38 +0000 (00:01 -0400)]
bnxt_en: Don't proceed in .ndo_set_rx_mode() when device is not in open state.
Check the BNXT_STATE_OPEN flag instead of netif_running() in
bnxt_set_rx_mode(). If the driver is going through any reset, such
as firmware reset or even TX timeout, it may not be ready to set the RX
mode and may crash. The new rx mode settings will be picked up when
the device is opened again later.
Fixes: 230d1f0de754 ("bnxt_en: Handle firmware reset.") Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
André Almeida [Mon, 16 Sep 2019 14:07:57 +0000 (11:07 -0300)]
null_blk: do not fail the module load with zero devices
The module load should fail only if there is something wrong with the
configuration or if an error prevents it to work properly. The module
should be able to be loaded with (nr_device == 0), since it will not
trigger errors or be in malfunction state. Preventing loading with zero
devices also breaks applications that configures this module using
configfs API. Remove the nr_device check to fix this.
Fixes: f7c4ce890dd2 ("null_blk: validate the number of devices") Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: André Almeida <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
The D-Link DIR-685 had its clock polarity set as active
low using the special SPI "spi-cpol" property.
This is not correct: the datasheet clearly states:
"Fix SCL to GND level when not in use" which is
indicative that this line is active high.
After a recent fix making the GPIO-based SPI driver
force the clock line de-asserted at the beginning of
each SPI transaction this reared its ugly head: now
de-asserted was taken to mean the line should be
driven high, but it should be driven low.
Fix this up in the DTS file and the display works again.
Thomas Higdon [Fri, 13 Sep 2019 23:23:35 +0000 (23:23 +0000)]
tcp: Add snd_wnd to TCP_INFO
Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP
performance problems --
> (1) Usually when we're diagnosing TCP performance problems, we do so
> from the sender, since the sender makes most of the
> performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
> From the sender-side the thing that would be most useful is to see
> tp->snd_wnd, the receive window that the receiver has advertised to
> the sender.
This serves the purpose of adding an additional __u32 to avoid the
would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
Thomas Higdon [Fri, 13 Sep 2019 23:23:34 +0000 (23:23 +0000)]
tcp: Add TCP_INFO counter for packets received out-of-order
For receive-heavy cases on the server-side, we want to track the
connection quality for individual client IPs. This counter, similar to
the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
tracks out-of-order packet reception. By providing this counter in
TCP_INFO, it will allow understanding to what degree receive-heavy
sockets are experiencing out-of-order delivery and packet drops
indicating congestion.
Please note that this is similar to the counter in NetBSD TCP_INFO, and
has the same name.
Also note that we avoid increasing the size of the tcp_sock struct by
taking advantage of a hole.
The MDIO device reset line is optional and now that gpiod_get_optional()
returns proper value when GPIO support is compiled out, there is no
reason to use fwnode_get_named_gpiod() that I plan to hide away.
Let's switch to using more standard gpiod_get_optional() and
gpiod_set_consumer_name() to keep the nice "PHY reset" label.
Also there is no reason to only try to fetch the reset GPIO when we have
OF node, gpiolib can fetch GPIO data from firmwares as well.
This commit introduces a new ioctl DM_GET_TARGET_VERSION. It will load a
target that is specified in the "name" entry in the parameter structure
and return its version.
This functionality is intended to be used by cryptsetup, so that it can
query kernel capabilities before activating the device.
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Now that initial BPF backend for gcc has been merged upstream, enable
BPF kselftest suite for bpf-gcc. Also fix a BE issue with access to
bpf_sysctl.file_pos, from Ilya.
2) Follow-up fix for link-vmlinux.sh to remove bash-specific extensions
related to recent work on exposing BTF info through sysfs, from Andrii.
3) AF_XDP zero copy fixes for i40e and ixgbe driver which caused umem
headroom to be added twice, from Ciara.
4) Refactoring work to convert sock opt tests into test_progs framework
in BPF kselftests, from Stanislav.
5) Fix a general protection fault in dev_map_hash_update_elem(), from Toke.
6) Cleanup to use BPF_PROG_RUN() macro in KCM, from Sami.
====================
Yixian Liu [Thu, 29 Aug 2019 08:41:41 +0000 (16:41 +0800)]
RDMA/hns: Optimize cmd init and mode selection for hip08
There are two modes for mailbox command (cmd) queue, i.e., event mode and
poll mode. For each mode, we use corresponding semaphores to protect the
cmd queue resource competition, so called event_sem and poll_sem. During
cmd init, both semaphores are initialized and poll mode is selected.
Thus, there is no need to up poll_sema again in cmd_use_polling.
Furthermore, there is no need to down the sema of the other side while
switching mode. This patch aims to decouple the switch between event mode
and poll mode of cmd.
Jonathan Chocron [Thu, 12 Sep 2019 13:02:38 +0000 (16:02 +0300)]
PCI: dwc: Add validation that PCIe core is set to correct mode
Some PCIe controllers can be set to either Host or EP according to some
early boot FW. To make sure there is no discrepancy (e.g. FW configured
the port to EP mode while the DT specifies it as a host bridge or vice
versa), a check has been added for each mode.
This driver is DT based and utilizes the DesignWare APIs.
It allows using a smaller ECAM range for a larger bus range -
usually an entire bus uses 1MB of address space, but the driver
can use it for a larger number of buses. This is achieved by using a HW
mechanism which allows changing the BUS part of the "final" outgoing
config transaction. There are 2 HW regs, one which is basically a
bitmask determining which bits to take from the AXI transaction itself
and another which holds the complementary part programmed by the
driver.
All link initializations are handled by the boot FW.
Jonathan Chocron [Thu, 12 Sep 2019 13:00:42 +0000 (16:00 +0300)]
PCI: Add quirk to disable MSI-X support for Amazon's Annapurna Labs Root Port
The Root Port (identified by [1c36:0031]) doesn't support MSI-X. On some
platforms it is configured to not advertise the capability at all, while
on others it (mistakenly) does. This causes a panic during
initialization by the pcieport driver, since it tries to configure the
MSI-X capability. Specifically, when trying to access the MSI-X table
a "non-existing addr" exception occurs.
Notice that this quirk also disables MSI (which may work, but hasn't
been tested nor has a current use case), since currently there is no
standard way to disable only MSI-X.