]> Git Repo - linux.git/log
linux.git
5 years agonet: ena: fix retrieval of nonadaptive interrupt moderation intervals
Arthur Kiyanovski [Mon, 16 Sep 2019 11:31:35 +0000 (14:31 +0300)]
net: ena: fix retrieval of nonadaptive interrupt moderation intervals

Nonadaptive interrupt moderation intervals are assigned the value set
by the user in ethtool -C divided by ena_dev->intr_delay_resolution.

Therefore when the user tries to get the nonadaptive interrupt moderation
intervals with ethtool -c the code needs to multiply the saved value
by ena_dev->intr_delay_resolution.

The current code erroneously divides instead of multiplying in ethtool -c.
This patch fixes this.

Signed-off-by: Arthur Kiyanovski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonet: ena: fix update of interrupt moderation register
Arthur Kiyanovski [Mon, 16 Sep 2019 11:31:34 +0000 (14:31 +0300)]
net: ena: fix update of interrupt moderation register

Current implementation always updates the interrupt register with
the smoothed_interval of the rx_ring. However this should be
done only in case of adaptive interrupt moderation. If non-adaptive
interrupt moderation is used, the non-adaptive interrupt moderation
interval should be used. This commit fixes that.

Signed-off-by: Arthur Kiyanovski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonet: ena: remove all old adaptive rx interrupt moderation code from ena_com
Arthur Kiyanovski [Mon, 16 Sep 2019 11:31:33 +0000 (14:31 +0300)]
net: ena: remove all old adaptive rx interrupt moderation code from ena_com

Remove previous implementation of adaptive rx interrupt moderation
from ena_com files.

Signed-off-by: Arthur Kiyanovski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonet: ena: remove ena_restore_ethtool_params() and relevant fields
Arthur Kiyanovski [Mon, 16 Sep 2019 11:31:32 +0000 (14:31 +0300)]
net: ena: remove ena_restore_ethtool_params() and relevant fields

Deleted unused 4 fields from struct ena_adapter and their only user
ena_restore_ethtool_params().

Signed-off-by: Arthur Kiyanovski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonet: ena: remove old adaptive interrupt moderation code from ena_netdev
Arthur Kiyanovski [Mon, 16 Sep 2019 11:31:31 +0000 (14:31 +0300)]
net: ena: remove old adaptive interrupt moderation code from ena_netdev

1. Out of the fields {per_napi_bytes, per_napi_packets} in struct ena_ring,
   only rx_ring->per_napi_packets are used to determine if napi did work
   for dim.
   This commit removes all other uses of these fields.
2. Remove ena_ring->moder_tbl_idx, which is not used by dim.
3. Remove all calls to ena_com_destroy_interrupt_moderation(), since all it
   did was to destroy the interrupt moderation table, which is removed as
   part of removing old interrupt moderation code.

Signed-off-by: Arthur Kiyanovski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonet: ena: remove code duplication in ena_com_update_nonadaptive_moderation_interval...
Arthur Kiyanovski [Mon, 16 Sep 2019 11:31:30 +0000 (14:31 +0300)]
net: ena: remove code duplication in ena_com_update_nonadaptive_moderation_interval _*()

Remove code duplication in:
ena_com_update_nonadaptive_moderation_interval_tx()
ena_com_update_nonadaptive_moderation_interval_rx()
functions.

Signed-off-by: Arthur Kiyanovski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonet: ena: enable the interrupt_moderation in driver_supported_features
Arthur Kiyanovski [Mon, 16 Sep 2019 11:31:29 +0000 (14:31 +0300)]
net: ena: enable the interrupt_moderation in driver_supported_features

Add driver_supported_features to host_host info which is a new API used to
communicate to the device which features are supported by the driver.

Add the interrupt_moderation bit to host_info->driver_supported_features
and enable it to signal the device that this driver supports interrupt
moderation properly.

Reserved bits are for features implemented in the future

Signed-off-by: Arthur Kiyanovski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonet: ena: reimplement set/get_coalesce()
Arthur Kiyanovski [Mon, 16 Sep 2019 11:31:28 +0000 (14:31 +0300)]
net: ena: reimplement set/get_coalesce()

1. Remove old adaptive interrupt moderation code from set/get_coalesce()
2. Add ena_update_rx_rings_intr_moderation() function for updating
   nonadaptive interrupt moderation intervals similarly to
   ena_update_tx_rings_intr_moderation().
3. Remove checks of multiple unsupported received interrupt coalescing
   parameters. This makes code cleaner and cancels the need to update
   it every time a new coalescing parameter is invented.

Signed-off-by: Arthur Kiyanovski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonet: ena: switch to dim algorithm for rx adaptive interrupt moderation
Arthur Kiyanovski [Mon, 16 Sep 2019 11:31:27 +0000 (14:31 +0300)]
net: ena: switch to dim algorithm for rx adaptive interrupt moderation

Use the dim library for the rx adaptive interrupt moderation implementation

Signed-off-by: Arthur Kiyanovski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonet: ena: add intr_moder_rx_interval to struct ena_com_dev and use it
Arthur Kiyanovski [Mon, 16 Sep 2019 11:31:26 +0000 (14:31 +0300)]
net: ena: add intr_moder_rx_interval to struct ena_com_dev and use it

Add intr_moder_rx_interval to struct ena_com_dev and use it as the
location where the interrupt moderation rx interval is saved, instead
of the interrupt moderation table.

This is done as a first step before removing the old interrupt moderation
code.

Signed-off-by: Arthur Kiyanovski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agoMerge branch 'ethtool-implement-Energy-Detect-Powerdown-support-via-phy-tunable'
David S. Miller [Mon, 16 Sep 2019 20:02:45 +0000 (22:02 +0200)]
Merge branch 'ethtool-implement-Energy-Detect-Powerdown-support-via-phy-tunable'

Alexandru Ardelean says:

====================
ethtool: implement Energy Detect Powerdown support via phy-tunable

This changeset proposes a new control for PHY tunable to control Energy
Detect Power Down.

The `phy_tunable_id` has been named `ETHTOOL_PHY_EDPD` since it looks like
this feature is common across other PHYs (like EEE), and defining
`ETHTOOL_PHY_ENERGY_DETECT_POWER_DOWN` seems too long.

The way EDPD works, is that the RX block is put to a lower power mode,
except for link-pulse detection circuits. The TX block is also put to low
power mode, but the PHY wakes-up periodically to send link pulses, to avoid
lock-ups in case the other side is also in EDPD mode.

Currently, there are 2 PHY drivers that look like they could use this new
PHY tunable feature: the `adin` && `micrel` PHYs.

This series updates only the `adin` PHY driver to support this new feature,
as this chip has been tested. A change for `micrel` can be proposed after a
discussion of the PHY-tunable API is resolved.
====================

Signed-off-by: David S. Miller <[email protected]>
5 years agonet: phy: adin: implement Energy Detect Powerdown mode via phy-tunable
Alexandru Ardelean [Mon, 16 Sep 2019 07:35:26 +0000 (10:35 +0300)]
net: phy: adin: implement Energy Detect Powerdown mode via phy-tunable

This driver becomes the first user of the kernel's `ETHTOOL_PHY_EDPD`
phy-tunable feature.
EDPD is also enabled by default on PHY config_init, but can be disabled via
the phy-tunable control.

When enabling EDPD, it's also a good idea (for the ADIN PHYs) to enable TX
periodic pulses, so that in case the other PHY is also on EDPD mode, there
is no lock-up situation where both sides are waiting for the other to
transmit.

Via the phy-tunable control, TX pulses can be disabled if specifying 0
`tx-interval` via ethtool.

The ADIN PHY supports only fixed 1 second intervals; they cannot be
configured. That is why the acceptable values are 1,
ETHTOOL_PHY_EDPD_DFLT_TX_MSECS and ETHTOOL_PHY_EDPD_NO_TX (which disables
TX pulses).

Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: Alexandru Ardelean <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agoethtool: implement Energy Detect Powerdown support via phy-tunable
Alexandru Ardelean [Mon, 16 Sep 2019 07:35:25 +0000 (10:35 +0300)]
ethtool: implement Energy Detect Powerdown support via phy-tunable

The `phy_tunable_id` has been named `ETHTOOL_PHY_EDPD` since it looks like
this feature is common across other PHYs (like EEE), and defining
`ETHTOOL_PHY_ENERGY_DETECT_POWER_DOWN` seems too long.

The way EDPD works, is that the RX block is put to a lower power mode,
except for link-pulse detection circuits. The TX block is also put to low
power mode, but the PHY wakes-up periodically to send link pulses, to avoid
lock-ups in case the other side is also in EDPD mode.

Currently, there are 2 PHY drivers that look like they could use this new
PHY tunable feature: the `adin` && `micrel` PHYs.

The ADIN's datasheet mentions that TX pulses are at intervals of 1 second
default each, and they can be disabled. For the Micrel KSZ9031 PHY, the
datasheet does not mention whether they can be disabled, but mentions that
they can modified.

The way this change is structured, is similar to the PHY tunable downshift
control:
* a `ETHTOOL_PHY_EDPD_DFLT_TX_MSECS` value is exposed to cover a default
  TX interval; some PHYs could specify a certain value that makes sense
* `ETHTOOL_PHY_EDPD_NO_TX` would disable TX when EDPD is enabled
* `ETHTOOL_PHY_EDPD_DISABLE` will disable EDPD

As noted by the `ETHTOOL_PHY_EDPD_DFLT_TX_MSECS` the interval unit is 1
millisecond, which should cover a reasonable range of intervals:
 - from 1 millisecond, which does not sound like much of a power-saver
 - to ~65 seconds which is quite a lot to wait for a link to come up when
   plugging a cable

Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: Alexandru Ardelean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agoxen-netfront: do not assume sk_buff_head list is empty in error handling
Dongli Zhang [Mon, 16 Sep 2019 03:46:59 +0000 (11:46 +0800)]
xen-netfront: do not assume sk_buff_head list is empty in error handling

When skb_shinfo(skb) is not able to cache extra fragment (that is,
skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS), xennet_fill_frags() assumes
the sk_buff_head list is already empty. As a result, cons is increased only
by 1 and returns to error handling path in xennet_poll().

However, if the sk_buff_head list is not empty, queue->rx.rsp_cons may be
set incorrectly. That is, queue->rx.rsp_cons would point to the rx ring
buffer entries whose queue->rx_skbs[i] and queue->grant_rx_ref[i] are
already cleared to NULL. This leads to NULL pointer access in the next
iteration to process rx ring buffer entries.

Below is how xennet_poll() does error handling. All remaining entries in
tmpq are accounted to queue->rx.rsp_cons without assuming how many
outstanding skbs are remained in the list.

 985 static int xennet_poll(struct napi_struct *napi, int budget)
... ...
1032           if (unlikely(xennet_set_skb_gso(skb, gso))) {
1033                   __skb_queue_head(&tmpq, skb);
1034                   queue->rx.rsp_cons += skb_queue_len(&tmpq);
1035                   goto err;
1036           }

It is better to always have the error handling in the same way.

Fixes: ad4f15dc2c70 ("xen/netfront: don't bug in case of too many frags")
Signed-off-by: Dongli Zhang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agos390/ctcm: Delete unnecessary checks before the macro call “dev_kfree_skb”
Markus Elfring [Sun, 15 Sep 2019 17:21:05 +0000 (19:21 +0200)]
s390/ctcm: Delete unnecessary checks before the macro call “dev_kfree_skb”

The dev_kfree_skb() function performs also input parameter validation.
Thus the test around the shown calls is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <[email protected]>
Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonet: ena: don't wake up tx queue when down
Sameeh Jubran [Sun, 15 Sep 2019 14:29:44 +0000 (17:29 +0300)]
net: ena: don't wake up tx queue when down

There is a race condition that can occur when calling ena_down().
The ena_clean_tx_irq() - which is a part of the napi handler -
function might wake up the tx queue when the queue is supposed
to be down (during recovery or changing the size of the queues
for example) This causes the ena_start_xmit() function to trigger
and possibly try to access the destroyed queues.

The race is illustrated below:

Flow A:                                       Flow B(napi handler)
ena_down()
   netif_carrier_off()
   netif_tx_disable()
                                                      ena_clean_tx_irq()
                                                         netif_tx_wake_queue()
   ena_napi_disable_all()
   ena_destroy_all_io_queues()

After these flows the tx queue is active and ena_start_xmit() accesses
the destroyed queue which leads to a kernel panic.

fixes: 1738cd3ed342 (net: ena: Add a driver for Amazon Elastic Network Adapters (ENA))

Signed-off-by: Sameeh Jubran <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agoMerge branch 'drop_monitor-Better-sanitize-notified-packets'
David S. Miller [Mon, 16 Sep 2019 19:39:27 +0000 (21:39 +0200)]
Merge branch 'drop_monitor-Better-sanitize-notified-packets'

Ido Schimmel says:

====================
drop_monitor: Better sanitize notified packets

When working in 'packet' mode, drop monitor generates a notification
with a potentially truncated payload of the dropped packet. The payload
is copied from the MAC header, but I forgot to check that the MAC header
was set, so do it now.

Patch #1 sets the offsets to the various protocol layers in netdevsim,
so that it will continue to work after the MAC header check is added to
drop monitor in patch #2.
====================

Signed-off-by: David S. Miller <[email protected]>
5 years agodrop_monitor: Better sanitize notified packets
Ido Schimmel [Sun, 15 Sep 2019 06:46:36 +0000 (09:46 +0300)]
drop_monitor: Better sanitize notified packets

When working in 'packet' mode, drop monitor generates a notification
with a potentially truncated payload of the dropped packet. The payload
is copied from the MAC header, but I forgot to check that the MAC header
was set, so do it now.

Fixes: ca30707dee2b ("drop_monitor: Add packet alert mode")
Fixes: 5e58109b1ea4 ("drop_monitor: Add support for packet alert mode for hardware drops")
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonetdevsim: Set offsets to various protocol layers
Ido Schimmel [Sun, 15 Sep 2019 06:46:35 +0000 (09:46 +0300)]
netdevsim: Set offsets to various protocol layers

The driver periodically generates "trapped" UDP packets that it then
passes on to devlink. Set the offsets to the various protocol layers.

This is a prerequisite to the next patch, where drop monitor is taught
to check that the offset to the MAC header was set.

Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agoMerge branch 'tc-taprio-offload-for-SJA1105-DSA'
David S. Miller [Mon, 16 Sep 2019 19:32:58 +0000 (21:32 +0200)]
Merge branch 'tc-taprio-offload-for-SJA1105-DSA'

Vladimir Oltean says:

====================
tc-taprio offload for SJA1105 DSA

This is the third attempt to submit the tc-taprio offload model for
inclusion in the networking tree. The sja1105 switch driver will provide
the first implementation of the offload. Only the bare minimum is added:

- The offload model and a DSA pass-through
- The hardware implementation
- The interaction with the netdev queues in the tagger code
- Documentation

What has been removed from previous attempts is support for
PTP-as-clocksource in sja1105, as well as configuring the traffic class
for management traffic.  These will be added as soon as the offload
model is settled.
====================

Signed-off-by: David S. Miller <[email protected]>
5 years agodocs: net: dsa: sja1105: Add info about the Time-Aware Scheduler
Vladimir Oltean [Sun, 15 Sep 2019 02:00:03 +0000 (05:00 +0300)]
docs: net: dsa: sja1105: Add info about the Time-Aware Scheduler

While not an exhaustive usage tutorial, this describes the details
needed to build more complex scenarios.

Signed-off-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonet: dsa: sja1105: Configure the Time-Aware Scheduler via tc-taprio offload
Vladimir Oltean [Sun, 15 Sep 2019 02:00:02 +0000 (05:00 +0300)]
net: dsa: sja1105: Configure the Time-Aware Scheduler via tc-taprio offload

This qdisc offload is the closest thing to what the SJA1105 supports in
hardware for time-based egress shaping. The switch core really is built
around SAE AS6802/TTEthernet (a TTTech standard) but can be made to
operate similarly to IEEE 802.1Qbv with some constraints:

- The gate control list is a global list for all ports. There are 8
  execution threads that iterate through this global list in parallel.
  I don't know why 8, there are only 4 front-panel ports.

- Care must be taken by the user to make sure that two execution threads
  never get to execute a GCL entry simultaneously. I created a O(n^4)
  checker for this hardware limitation, prior to accepting a taprio
  offload configuration as valid.

- The spec says that if a GCL entry's interval is shorter than the frame
  length, you shouldn't send it (and end up in head-of-line blocking).
  Well, this switch does anyway.

- The switch has no concept of ADMIN and OPER configurations. Because
  it's so simple, the TAS settings are loaded through the static config
  tables interface, so there isn't even place for any discussion about
  'graceful switchover between ADMIN and OPER'. You just reset the
  switch and upload a new OPER config.

- The switch accepts multiple time sources for the gate events. Right
  now I am using the standalone clock source as opposed to PTP. So the
  base time parameter doesn't really do much. Support for the PTP clock
  source will be added in a future series.

Signed-off-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonet: dsa: sja1105: Advertise the 8 TX queues
Vladimir Oltean [Sun, 15 Sep 2019 02:00:01 +0000 (05:00 +0300)]
net: dsa: sja1105: Advertise the 8 TX queues

This is a preparation patch for the tc-taprio offload (and potentially
for other future offloads such as tc-mqprio).

Instead of looking directly at skb->priority during xmit, let's get the
netdev queue and the queue-to-traffic-class mapping, and put the
resulting traffic class into the dsa_8021q PCP field. The switch is
configured with a 1-to-1 PCP-to-ingress-queue-to-egress-queue mapping
(see vlan_pmap in sja1105_main.c), so the effect is that we can inject
into a front-panel's egress traffic class through VLAN tagging from
Linux, completely transparently.

Unfortunately the switch doesn't look at the VLAN PCP in the case of
management traffic to/from the CPU (link-local frames at
01-80-C2-xx-xx-xx or 01-1B-19-xx-xx-xx) so we can't alter the
transmission queue of this type of traffic on a frame-by-frame basis. It
is only selected through the "hostprio" setting which ATM is harcoded in
the driver to 7.

Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonet: dsa: sja1105: Add static config tables for scheduling
Vladimir Oltean [Sun, 15 Sep 2019 02:00:00 +0000 (05:00 +0300)]
net: dsa: sja1105: Add static config tables for scheduling

In order to support tc-taprio offload, the TTEthernet egress scheduling
core registers must be made visible through the static interface.

Signed-off-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonet: dsa: Pass ndo_setup_tc slave callback to drivers
Vladimir Oltean [Sun, 15 Sep 2019 01:59:59 +0000 (04:59 +0300)]
net: dsa: Pass ndo_setup_tc slave callback to drivers

DSA currently handles shared block filters (for the classifier-action
qdisc) in the core due to what I believe are simply pragmatic reasons -
hiding the complexity from drivers and offerring a simple API for port
mirroring.

Extend the dsa_slave_setup_tc function by passing all other qdisc
offloads to the driver layer, where the driver may choose what it
implements and how. DSA is simply a pass-through in this case.

Signed-off-by: Vladimir Oltean <[email protected]>
Acked-by: Kurt Kanzenbach <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Acked-by: Ilias Apalodimas <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agotaprio: Add support for hardware offloading
Vinicius Costa Gomes [Sun, 15 Sep 2019 01:59:58 +0000 (04:59 +0300)]
taprio: Add support for hardware offloading

This allows taprio to offload the schedule enforcement to capable
network cards, resulting in more precise windows and less CPU usage.

The gate mask acts on traffic classes (groups of queues of same
priority), as specified in IEEE 802.1Q-2018, and following the existing
taprio and mqprio semantics.
It is up to the driver to perform conversion between tc and individual
netdev queues if for some reason it needs to make that distinction.

Full offload is requested from the network interface by specifying
"flags 2" in the tc qdisc creation command, which in turn corresponds to
the TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD bit.

The important detail here is the clockid which is implicitly /dev/ptpN
for full offload, and hence not configurable.

A reference counting API is added to support the use case where Ethernet
drivers need to keep the taprio offload structure locally (i.e. they are
a multi-port switch driver, and configuring a port depends on the
settings of other ports as well). The refcount_t variable is kept in a
private structure (__tc_taprio_qopt_offload) and not exposed to drivers.

In the future, the private structure might also be expanded with a
backpointer to taprio_sched *q, to implement the notification system
described in the patch (of when admin became oper, or an error occurred,
etc, so the offload can be monitored with 'tc qdisc show').

Signed-off-by: Vinicius Costa Gomes <[email protected]>
Signed-off-by: Voon Weifeng <[email protected]>
Signed-off-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agoRDMA: Fix double-free in srq creation error flow
Jack Morgenstein [Mon, 16 Sep 2019 07:11:54 +0000 (10:11 +0300)]
RDMA: Fix double-free in srq creation error flow

The cited commit introduced a double-free of the srq buffer in the error
flow of procedure __uverbs_create_xsrq().

The problem is that ib_destroy_srq_user() called in the error flow also
frees the srq buffer.

Thus, if uverbs_response() fails in __uverbs_create_srq(), the srq buffer
will be freed twice.

Cc: <[email protected]>
Fixes: 68e326dea1db ("RDMA: Handle SRQ allocations by IB/core")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jack Morgenstein <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
5 years agoRDMA/efa: Fix incorrect error print
Gal Pressman [Tue, 10 Sep 2019 13:42:58 +0000 (14:42 +0100)]
RDMA/efa: Fix incorrect error print

The error print should indicate that it failed to get the queue
attributes, not network attributes.

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Daniel Kranzdorf <[email protected]>
Reviewed-by: Firas JahJah <[email protected]>
Signed-off-by: Gal Pressman <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
5 years agoIB/mlx5: Free mpi in mp_slave mode
Danit Goldberg [Mon, 16 Sep 2019 06:48:18 +0000 (09:48 +0300)]
IB/mlx5: Free mpi in mp_slave mode

ib_add_slave_port() allocates a multiport struct but never frees it.
Don't leak memory, free the allocated mpi struct during driver unload.

Cc: <[email protected]>
Fixes: 32f69e4be269 ("{net, IB}/mlx5: Manage port association for multiport RoCE")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Danit Goldberg <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
5 years agoMerge branch 'next' into for-linus
Dmitry Torokhov [Mon, 16 Sep 2019 16:56:27 +0000 (09:56 -0700)]
Merge branch 'next' into for-linus

Prepare input updates for 5.4 merge window.

5 years agocifs: cifsroot: add more err checking
Aurelien Aptel [Mon, 16 Sep 2019 03:45:42 +0000 (05:45 +0200)]
cifs: cifsroot: add more err checking

make cifs more verbose about buffer size errors
and add some comments

Signed-off-by: Aurelien Aptel <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agosmb3: add missing worker function for SMB3 change notify
Steve French [Mon, 16 Sep 2019 03:38:52 +0000 (22:38 -0500)]
smb3: add missing worker function for SMB3 change notify

SMB3 change notify is important to allow applications to wait
on directory change events of different types (e.g. adding
and deleting files from others systems). Add worker functions
for this.

Acked-by: Aurelien Aptel <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agocifs: Add support for root file systems
Paulo Alcantara (SUSE) [Tue, 16 Jul 2019 22:04:50 +0000 (19:04 -0300)]
cifs: Add support for root file systems

Introduce a new CONFIG_CIFS_ROOT option to handle root file systems
over a SMB share.

In order to mount the root file system during the init process, make
cifs.ko perform non-blocking socket operations while mounting and
accessing it.

Cc: Steve French <[email protected]>
Reviewed-by: Aurelien Aptel <[email protected]>
Signed-off-by: Paulo Alcantara (SUSE) <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agocifs: modefromsid: make room for 4 ACE
Aurelien Aptel [Mon, 16 Sep 2019 02:28:36 +0000 (04:28 +0200)]
cifs: modefromsid: make room for 4 ACE

when mounting with modefromsid, we end up writing 4 ACE in a security
descriptor that only has room for 3, thus triggering an out-of-bounds
write. fix this by changing the min size of a security descriptor.

Signed-off-by: Aurelien Aptel <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agosmb3: fix potential null dereference in decrypt offload
Steve French [Fri, 13 Sep 2019 21:47:31 +0000 (16:47 -0500)]
smb3: fix potential null dereference in decrypt offload

commit a091c5f67c99 ("smb3: allow parallelizing decryption of reads")
had a potential null dereference

Reported-by: kbuild test robot <[email protected]>
Reported-by: Dan Carpenter <[email protected]>
Suggested-by: Pavel Shilovsky <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agosmb3: fix unmount hang in open_shroot
Steve French [Thu, 12 Sep 2019 22:52:54 +0000 (17:52 -0500)]
smb3: fix unmount hang in open_shroot

An earlier patch "CIFS: fix deadlock in cached root handling"
did not completely address the deadlock in open_shroot. This
patch addresses the deadlock.

In testing the recent patch:
  smb3: improve handling of share deleted (and share recreated)
we were able to reproduce the open_shroot deadlock to one
of the target servers in unmount in a delete share scenario.

Fixes: 7e5a70ad88b1e ("CIFS: fix deadlock in cached root handling")
This is version 2 of this patch. An earlier version of this
patch "smb3: fix unmount hang in open_shroot" had a problem
found by Dan.

Reported-by: kbuild test robot <[email protected]>
Reported-by: Dan Carpenter <[email protected]>
Suggested-by: Pavel Shilovsky <[email protected]>
Reviewed-by: Pavel Shilovsky <[email protected]>
Signed-off-by: Steve French <[email protected]>
CC: Aurelien Aptel <[email protected]>
CC: Stable <[email protected]>
5 years agosmb3: allow disabling requesting leases
Steve French [Thu, 12 Sep 2019 02:46:20 +0000 (21:46 -0500)]
smb3: allow disabling requesting leases

In some cases to work around server bugs or performance
problems it can be helpful to be able to disable requesting
SMB2.1/SMB3 leases on a particular mount (not to all servers
and all shares we are mounted to). Add new mount parm
"nolease" which turns off requesting leases on directory
or file opens.  Currently the only way to disable leases is
globally through a module load parameter. This is more
granular.

Suggested-by: Pavel Shilovsky <[email protected]>
Signed-off-by: Steve French <[email protected]>
Reviewed-by: Ronnie Sahlberg <[email protected]>
Reviewed-by: Pavel Shilovsky <[email protected]>
CC: Stable <[email protected]>
5 years agosmb3: improve handling of share deleted (and share recreated)
Steve French [Wed, 11 Sep 2019 05:07:36 +0000 (00:07 -0500)]
smb3: improve handling of share deleted (and share recreated)

When a share is deleted, returning EIO is confusing and no useful
information is logged.  Improve the handling of this case by
at least logging a better error for this (and also mapping the error
differently to EREMCHG).  See e.g. the new messages that would be logged:

[55243.639530] server share \\192.168.1.219\scratch deleted
[55243.642568] CIFS VFS: \\192.168.1.219\scratch BAD_NETWORK_NAME: \\192.168.1.219\scratch

In addition for the case where a share is deleted and then recreated
with the same name, have now fixed that so it works. This is sometimes
done for example, because the admin had to move a share to a different,
bigger local drive when a share is running low on space.

Signed-off-by: Steve French <[email protected]>
Reviewed-by: Ronnie Sahlberg <[email protected]>
5 years agosmb3: display max smb3 requests in flight at any one time
Steve French [Tue, 10 Sep 2019 03:57:11 +0000 (22:57 -0500)]
smb3: display max smb3 requests in flight at any one time

Displayed in /proc/fs/cifs/Stats once for each
socket we are connected to.

This allows us to find out what the maximum number of
requests that had been in flight (at any one time). Note that
/proc/fs/cifs/Stats can be reset if you want to look for
maximum over a small period of time.

Sample output (immediately after mount):

Resources in use
CIFS Session: 1
Share (unique mount targets): 2
SMB Request/Response Buffer: 1 Pool size: 5
SMB Small Req/Resp Buffer: 1 Pool size: 30
Operations (MIDs): 0

0 session 0 share reconnects
Total vfs operations: 5 maximum at one time: 2

Max requests in flight: 2
1) \\localhost\scratch
SMBs: 18
Bytes read: 0  Bytes written: 0
...

Signed-off-by: Steve French <[email protected]>
Reviewed-by: Pavel Shilovsky <[email protected]>
5 years agosmb3: only offload decryption of read responses if multiple requests
Steve French [Mon, 9 Sep 2019 18:30:15 +0000 (13:30 -0500)]
smb3: only offload decryption of read responses if multiple requests

No point in offloading read decryption if no other requests on the
wire

Signed-off-by: Steve French <[email protected]>
Signed-off-by: Ronnie Sahlberg <[email protected]>
5 years agocifs: add a helper to find an existing readable handle to a file
Ronnie Sahlberg [Mon, 9 Sep 2019 05:30:00 +0000 (15:30 +1000)]
cifs: add a helper to find an existing readable handle to a file

and convert smb2_query_path_info() to use it.
This will eliminate the need for a SMB2_Create when we already have an
open handle that can be used. This will also prevent a oplock break
in case the other handle holds a lease.

Signed-off-by: Ronnie Sahlberg <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agosmb3: enable offload of decryption of large reads via mount option
Steve French [Mon, 9 Sep 2019 04:22:02 +0000 (23:22 -0500)]
smb3: enable offload of decryption of large reads via mount option

Disable offload of the decryption of encrypted read responses
by default (equivalent to setting this new mount option "esize=0").

Allow setting the minimum encrypted read response size that we
will choose to offload to a worker thread - it is now configurable
via on a new mount option "esize="

Depending on which encryption mechanism (GCM vs. CCM) and
the number of reads that will be issued in parallel and the
performance of the network and CPU on the client, it may make
sense to enable this since it can provide substantial benefit when
multiple large reads are in flight at the same time.

Signed-off-by: Steve French <[email protected]>
Signed-off-by: Ronnie Sahlberg <[email protected]>
5 years agosmb3: allow parallelizing decryption of reads
Steve French [Sat, 7 Sep 2019 06:09:49 +0000 (01:09 -0500)]
smb3: allow parallelizing decryption of reads

decrypting large reads on encrypted shares can be slow (e.g. adding
multiple milliseconds per-read on non-GCM capable servers or
when mounting with dialects prior to SMB3.1.1) - allow parallelizing
of read decryption by launching worker threads.

Testing to Samba on localhost showed 25% improvement.
Testing to remote server showed very large improvement when
doing more than one 'cp' command was called at one time.

Signed-off-by: Steve French <[email protected]>
Signed-off-by: Ronnie Sahlberg <[email protected]>
5 years agocifs: add a debug macro that prints \\server\share for errors
Ronnie Sahlberg [Wed, 4 Sep 2019 02:32:41 +0000 (12:32 +1000)]
cifs: add a debug macro that prints \\server\share for errors

Where we have a tcon available we can log \\server\share as part
of the message. Only do this for the VFS log level.

Signed-off-by: Ronnie Sahlberg <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agosmb3: fix signing verification of large reads
Steve French [Thu, 5 Sep 2019 04:07:52 +0000 (23:07 -0500)]
smb3: fix signing verification of large reads

Code cleanup in the 5.1 kernel changed the array
passed into signing verification on large reads leading
to warning messages being logged when copying files to local
systems from remote.

   SMB signature verification returned error = -5

This changeset fixes verification of SMB3 signatures of large
reads.

Suggested-by: Pavel Shilovsky <[email protected]>
Signed-off-by: Steve French <[email protected]>
Reviewed-by: Ronnie Sahlberg <[email protected]>
5 years agosmb3: allow skipping signature verification for perf sensitive configurations
Steve French [Wed, 4 Sep 2019 02:18:49 +0000 (21:18 -0500)]
smb3: allow skipping signature verification for perf sensitive configurations

Add new mount option "signloosely" which enables signing but skips the
sometimes expensive signing checks in the responses (signatures are
calculated and sent correctly in the SMB2/SMB3 requests even with this
mount option but skipped in the responses).  Although weaker for security
(and also data integrity in case a packet were corrupted), this can provide
enough of a performance benefit (calculating the signature to verify a
packet can be expensive especially for large packets) to be useful in
some cases.

Signed-off-by: Steve French <[email protected]>
Reviewed-by: Ronnie Sahlberg <[email protected]>
5 years agosmb3: add dynamic tracepoints for flush and close
Steve French [Tue, 3 Sep 2019 23:35:42 +0000 (18:35 -0500)]
smb3: add dynamic tracepoints for flush and close

We only had dynamic tracepoints on errors in flush
and close, but may be helpful to trace enter
and non-error exits for those.  Sample trace examples
(excerpts) from "cp" and "dd" show two of the new
tracepoints.

  cp-22823 [002] .... 123439.179701: smb3_enter: _cifsFileInfo_put: xid=10
  cp-22823 [002] .... 123439.179705: smb3_close_enter: xid=10 sid=0x98871327 tid=0xfcd585ff fid=0xc7f84682
  cp-22823 [002] .... 123439.179711: smb3_cmd_enter: sid=0x98871327 tid=0xfcd585ff cmd=6 mid=43
  cp-22823 [002] .... 123439.180175: smb3_cmd_done: sid=0x98871327 tid=0xfcd585ff cmd=6 mid=43
  cp-22823 [002] .... 123439.180179: smb3_close_done: xid=10 sid=0x98871327 tid=0xfcd585ff fid=0xc7f84682

  dd-22981 [003] .... 123696.946011: smb3_flush_enter: xid=24 sid=0x98871327 tid=0xfcd585ff fid=0x1917736f
  dd-22981 [003] .... 123696.946013: smb3_cmd_enter: sid=0x98871327 tid=0xfcd585ff cmd=7 mid=123
  dd-22981 [003] .... 123696.956639: smb3_cmd_done: sid=0x98871327 tid=0x0 cmd=7 mid=123
  dd-22981 [003] .... 123696.956644: smb3_flush_done: xid=24 sid=0x98871327 tid=0xfcd585ff fid=0x1917736f

Signed-off-by: Steve French <[email protected]>
Reviewed-by: Ronnie Sahlberg <[email protected]>
5 years agosmb3: log warning if CSC policy conflicts with cache mount option
Steve French [Tue, 3 Sep 2019 22:49:46 +0000 (17:49 -0500)]
smb3: log warning if CSC policy conflicts with cache mount option

If the server config (e.g. Samba smb.conf "csc policy = disable)
for the share indicates that the share should not be cached, log
a warning message if forced client side caching ("cache=ro" or
"cache=singleclient") is requested on mount.

Signed-off-by: Steve French <[email protected]>
Reviewed-by: Ronnie Sahlberg <[email protected]>
5 years agosmb3: add mount option to allow RW caching of share accessed by only 1 client
Steve French [Fri, 30 Aug 2019 07:12:41 +0000 (02:12 -0500)]
smb3: add mount option to allow RW caching of share accessed by only 1 client

If a share is known to be only to be accessed by one client, we
can aggressively cache writes not just reads to it.

Add "cache=" option (cache=singleclient) for mounting read write shares
(that will not be read or written to from other clients while we have
it mounted) in order to improve performance.

Signed-off-by: Steve French <[email protected]>
5 years agosmb3: add some more descriptive messages about share when mounting cache=ro
Steve French [Fri, 30 Aug 2019 03:33:38 +0000 (22:33 -0500)]
smb3: add some more descriptive messages about share when mounting cache=ro

Add some additional logging so the user can see if the share they
mounted with cache=ro is considered read only by the server

CIFS: Attempting to mount //localhost/test
CIFS VFS: mounting share with read only caching. Ensure that the share will not be modified while in use.
CIFS VFS: read only mount of RW share

CIFS: Attempting to mount //localhost/test-ro
CIFS VFS: mounting share with read only caching. Ensure that the share will not be modified while in use.
CIFS VFS: mounted to read only share

Signed-off-by: Steve French <[email protected]>
Reviewed-by: Ronnie Sahlberg <[email protected]>
5 years agosmb3: add mount option to allow forced caching of read only share
Steve French [Wed, 28 Aug 2019 04:58:54 +0000 (23:58 -0500)]
smb3: add mount option to allow forced caching of read only share

If a share is immutable (at least for the period that it will
be mounted) it would be helpful to not have to revalidate
dentries repeatedly that we know can not be changed remotely.

Add "cache=" option (cache=ro) for mounting read only shares
in order to improve performance in cases in which we know that
the share will not be changing while it is in use.

Signed-off-by: Steve French <[email protected]>
5 years agocifs: fix dereference on ses before it is null checked
Colin Ian King [Mon, 2 Sep 2019 15:10:59 +0000 (16:10 +0100)]
cifs: fix dereference on ses before it is null checked

The assignment of pointer server dereferences pointer ses, however,
this dereference occurs before ses is null checked and hence we
have a potential null pointer dereference.  Fix this by only
dereferencing ses after it has been null checked.

Addresses-Coverity: ("Dereference before null check")
Fixes: 2808c6639104 ("cifs: add new debugging macro cifs_server_dbg")
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agocifs: add new debugging macro cifs_server_dbg
Ronnie Sahlberg [Wed, 28 Aug 2019 07:15:35 +0000 (17:15 +1000)]
cifs: add new debugging macro cifs_server_dbg

which can be used from contexts where we have a TCP_Server_Info *server.
This new macro will prepend the debugging string with "Server:<servername> "
which will help when debugging issues on hosts with many cifs connections
to several different servers.

Convert a bunch of cifs_dbg(VFS) calls to cifs_server_dbg(VFS)

Signed-off-by: Ronnie Sahlberg <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agocifs: use existing handle for compound_op(OP_SET_INFO) when possible
Ronnie Sahlberg [Thu, 29 Aug 2019 23:53:56 +0000 (09:53 +1000)]
cifs: use existing handle for compound_op(OP_SET_INFO) when possible

If we already have a writable handle for a path we want to set the
attributes for then use that instead of a create/set-info/close compound.

Signed-off-by: Ronnie Sahlberg <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agocifs: create a helper to find a writeable handle by path name
Ronnie Sahlberg [Thu, 29 Aug 2019 22:25:46 +0000 (08:25 +1000)]
cifs: create a helper to find a writeable handle by path name

rename() takes a path for old_file and in SMB2 we used to just create
a compound for create(old_path)/rename/close().
If we already have a writable handle we can avoid the create() and close()
altogether and just use the existing handle.

For this situation, as we avoid doing the create()
we also avoid triggering an oplock break for the existing handle.

Signed-off-by: Ronnie Sahlberg <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agocifs: remove set but not used variables
YueHaibing [Fri, 23 Aug 2019 12:15:35 +0000 (20:15 +0800)]
cifs: remove set but not used variables

Fixes gcc '-Wunused-but-set-variable' warning:

fs/cifs/file.c: In function cifs_lock:
fs/cifs/file.c:1696:24: warning: variable cinode set but not used [-Wunused-but-set-variable]
fs/cifs/file.c: In function cifs_write:
fs/cifs/file.c:1765:23: warning: variable cifs_sb set but not used [-Wunused-but-set-variable]
fs/cifs/file.c: In function collect_uncached_read_data:
fs/cifs/file.c:3578:20: warning: variable tcon set but not used [-Wunused-but-set-variable]

'cinode' is never used since introduced by
commit 03776f4516bc ("CIFS: Simplify byte range locking code")
'cifs_sb' is not used since commit cb7e9eabb2b5 ("CIFS: Use
multicredits for SMB 2.1/3 writes").
'tcon' is not used since commit d26e2903fc10 ("smb3: fix bytes_read statistics")

Reported-by: Hulk Robot <[email protected]>
Signed-off-by: YueHaibing <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agosmb3: Incorrect size for netname negotiate context
Steve French [Mon, 5 Aug 2019 22:07:26 +0000 (17:07 -0500)]
smb3: Incorrect size for netname negotiate context

It is not null terminated (length was off by two).

Also see similar change to Samba:

https://gitlab.com/samba-team/samba/merge_requests/666

Reported-by: Stefan Metzmacher <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agocifs: remove unused variable
zhengbin [Tue, 20 Aug 2019 14:00:47 +0000 (22:00 +0800)]
cifs: remove unused variable

In smb3_punch_hole, variable cifsi set but not used, remove it.
In cifs_lock, variable netfid set but not used, remove it.

Reported-by: Hulk Robot <[email protected]>
Signed-off-by: zhengbin <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agocifs: remove redundant assignment to variable rc
Colin Ian King [Wed, 31 Jul 2019 09:05:26 +0000 (10:05 +0100)]
cifs: remove redundant assignment to variable rc

Variable rc is being initialized with a value that is never read
and rc is being re-assigned a little later on. The assignment is
redundant and hence can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agosmb3: add missing flag definitions
Steve French [Thu, 25 Jul 2019 23:19:42 +0000 (18:19 -0500)]
smb3: add missing flag definitions

SMB3 and 3.1.1 added two additional flags including
the priority mask.  Add them to our protocol definitions
in smb2pdu.h.  See MS-SMB2 2.2.1.2

Signed-off-by: Steve French <[email protected]>
Signed-off-by: Ronnie Sahlberg <[email protected]>
Reviewed-by: Pavel Shilovsky <[email protected]>
5 years agocifs: add passthrough for smb2 setinfo
Ronnie Sahlberg [Thu, 25 Jul 2019 03:08:43 +0000 (13:08 +1000)]
cifs: add passthrough for smb2 setinfo

Add support to send smb2 set-info commands from userspace.

Signed-off-by: Ronnie Sahlberg <[email protected]>
Signed-off-by: Steve French <[email protected]>
Reviewed-by: Paulo Alcantara <[email protected]>
5 years agocifs: prepare SMB2_Flush to be usable in compounds
Ronnie Sahlberg [Tue, 16 Jul 2019 05:07:08 +0000 (15:07 +1000)]
cifs: prepare SMB2_Flush to be usable in compounds

Create smb2_flush_init() and smb2_flush_free() so we can use the flush command
in compounds.

Signed-off-by: Ronnie Sahlberg <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agocifs: allow chmod to set mode bits using special sid
Steve French [Fri, 19 Jul 2019 08:15:55 +0000 (08:15 +0000)]
cifs: allow chmod to set mode bits using special sid

    When mounting with "modefromsid" set mode bits (chmod) by
    adding ACE with special SID (S-1-5-88-3-<mode>) to the ACL.
    Subsequent patch will fix setting default mode on file
    create and mkdir.

    See See e.g.
        https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2008-R2-and-2008/hh509017(v=ws.10)

Signed-off-by: Steve French <[email protected]>
5 years agocifs: get mode bits from special sid on stat
Steve French [Fri, 19 Jul 2019 06:30:07 +0000 (06:30 +0000)]
cifs: get mode bits from special sid on stat

When mounting with "modefromsid" retrieve mode bits from
special SID (S-1-5-88-3) on stat.  Subsequent patch will fix
setattr (chmod) to save mode bits in S-1-5-88-3-<mode>

Note that when an ACE matching S-1-5-88-3 is not found, we
default the mode to an approximation based on the owner, group
and everyone permissions (as with the "cifsacl" mount option).

See See e.g.
    https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2008-R2-and-2008/hh509017(v=ws.10)

Signed-off-by: Steve French <[email protected]>
5 years agofs: cifs: cifsssmb: remove redundant assignment to variable ret
Colin Ian King [Tue, 23 Jul 2019 15:09:19 +0000 (16:09 +0100)]
fs: cifs: cifsssmb: remove redundant assignment to variable ret

The variable ret is being initialized however this is never read
and later it is being reassigned to a new value. The initialization
is redundant and hence can be removed.

Addresses-Coverity: ("Unused Value")
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agocifs: fix a comment for the timeouts when sending echos
Ronnie Sahlberg [Wed, 24 Jul 2019 01:43:49 +0000 (11:43 +1000)]
cifs: fix a comment for the timeouts when sending echos

Clarify a trivial comment

Signed-off-by: Ronnie Sahlberg <[email protected]>
Signed-off-by: Steve French <[email protected]>
5 years agoIB/mlx5: Use the original address for the page during free_pages
Danit Goldberg [Mon, 16 Sep 2019 06:48:17 +0000 (09:48 +0300)]
IB/mlx5: Use the original address for the page during free_pages

The removal of 'buffer' in the patch below caused free_page() to use a
value that had been offset since the wqe pointer is adjusted while the
routine runs.

The current implementation of free_pages() rounds down to a pfn,
discarding the adjustment, but this is not the right way to use the
API. Preserve the initial value and use it for free_page().

Fixes: 0f51427bd097 ("RDMA/mlx5: Cleanup WQE page fault handler")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Danit Goldberg <[email protected]>
Reviewed-by: Yishai Hadas <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
5 years agoMerge tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/braune...
Linus Torvalds [Mon, 16 Sep 2019 16:28:19 +0000 (09:28 -0700)]
Merge tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull pidfd/waitid updates from Christian Brauner:
 "This contains two features and various tests.

  First, it adds support for waiting on process through pidfds by adding
  the P_PIDFD type to the waitid() syscall. This completes the basic
  functionality of the pidfd api (cf. [1]). In the meantime we also have
  a new adition to the userspace projects that make use of the pidfd
  api. The qt project was nice enough to send a mail pointing out that
  they have a pr up to switch to the pidfd api (cf. [2]).

  Second, this tag contains an extension to the waitid() syscall to make
  it possible to wait on the current process group in a race free manner
  (even though the actual problem is very unlikely) by specifing 0
  together with the P_PGID type. This extension traces back to a
  discussion on the glibc development mailing list.

  There are also a range of tests for the features above. Additionally,
  the test-suite which detected the pidfd-polling race we fixed in [3]
  is included in this tag"

[1] https://lwn.net/Articles/794707/
[2] https://codereview.qt-project.org/c/qt/qtbase/+/108456
[3] commit b191d6491be6 ("pidfd: fix a poll race when setting exit_state")

* tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  waitid: Add support for waiting for the current process group
  tests: add pidfd poll tests
  tests: move common definitions and functions into pidfd.h
  pidfd: add pidfd_wait tests
  pidfd: add P_PIDFD to waitid()

5 years agof2fs: fix to add missing F2FS_IO_ALIGNED() condition
Chao Yu [Wed, 28 Aug 2019 09:33:38 +0000 (17:33 +0800)]
f2fs: fix to add missing F2FS_IO_ALIGNED() condition

In f2fs_allocate_data_block(), we will reset fio.retry for IO
alignment feature instead of IO serialization feature.

In addition, spread F2FS_IO_ALIGNED() to check IO alignment
feature status explicitly.

Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
5 years agof2fs: fix to fallback to buffered IO in IO aligned mode
Chao Yu [Wed, 28 Aug 2019 09:33:37 +0000 (17:33 +0800)]
f2fs: fix to fallback to buffered IO in IO aligned mode

In LFS mode, we allow OPU for direct IO, however, we didn't consider
IO alignment feature, so direct IO can trigger unaligned IO, let's
just fallback to buffered IO to keep correct IO alignment semantics
in all places.

Fixes: f847c699cff3 ("f2fs: allow out-place-update for direct IO in LFS mode")
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
5 years agof2fs: fix to handle error path correctly in f2fs_map_blocks
Chao Yu [Wed, 28 Aug 2019 09:33:36 +0000 (17:33 +0800)]
f2fs: fix to handle error path correctly in f2fs_map_blocks

In f2fs_map_blocks(), we should bail out once __allocate_data_block()
failed.

Fixes: f847c699cff3 ("f2fs: allow out-place-update for direct IO in LFS mode")
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
5 years agof2fs: fix extent corrupotion during directIO in LFS mode
Chao Yu [Wed, 28 Aug 2019 09:33:35 +0000 (17:33 +0800)]
f2fs: fix extent corrupotion during directIO in LFS mode

In LFS mode, por_fsstress testcase reports a bug as below:

[ASSERT] (fsck_chk_inode_blk: 931)  --> ino: 0x12fe has wrong ext: [pgofs:142, blk:215424, len:16]

Since commit f847c699cff3 ("f2fs: allow out-place-update for direct
IO in LFS mode"), we start to allow OPU mode for direct IO, however,
we missed to update extent cache in __allocate_data_block(), finally,
it cause extent field being inconsistent with physical block address,
fix it.

Fixes: f847c699cff3 ("f2fs: allow out-place-update for direct IO in LFS mode")
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
5 years agof2fs: check all the data segments against all node ones
Surbhi Palande [Fri, 23 Aug 2019 22:40:45 +0000 (15:40 -0700)]
f2fs: check all the data segments against all node ones

As a part of the sanity checking while mounting, distinct segment number
assignment to data and node segments is verified. Fixing a small bug in
this verification between node and data segments. We need to check all
the data segments with all the node segments.

Fixes: 042be0f849e5f ("f2fs: fix to do sanity check with current segment number")
Signed-off-by: Surbhi Palande <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
5 years agof2fs: Add a small clarification to CONFIG_FS_F2FS_FS_SECURITY
Lockywolf [Sun, 25 Aug 2019 09:28:38 +0000 (17:28 +0800)]
f2fs: Add a small clarification to CONFIG_FS_F2FS_FS_SECURITY

Signed-off-by: Lockywolf <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
5 years agof2fs: fix inode rwsem regression
Goldwyn Rodrigues [Wed, 11 Sep 2019 16:45:17 +0000 (11:45 -0500)]
f2fs: fix inode rwsem regression

This is similar to 942491c9e6d6 ("xfs: fix AIM7 regression")
Apparently our current rwsem code doesn't like doing the trylock, then
lock for real scheme.  So change our read/write methods to just do the
trylock for the RWF_NOWAIT case.

We don't need a check for IOCB_NOWAIT and !direct-IO because it
is checked in generic_write_checks().

Fixes: b91050a80cec ("f2fs: add nowait aio support")
Signed-off-by: Goldwyn Rodrigues <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
5 years agof2fs: fix to avoid accessing uninitialized field of inode page in is_alive()
Chao Yu [Tue, 10 Sep 2019 01:14:16 +0000 (09:14 +0800)]
f2fs: fix to avoid accessing uninitialized field of inode page in is_alive()

If inode is newly created, inode page may not synchronize with inode cache,
so fields like .i_inline or .i_extra_isize could be wrong, in below call
path, we may access such wrong fields, result in failing to migrate valid
target block.

Thread A Thread B
- f2fs_create
 - f2fs_add_link
  - f2fs_add_dentry
   - f2fs_init_inode_metadata
    - f2fs_add_inline_entry
     - f2fs_new_inode_page
     - f2fs_put_page
     : inode page wasn't updated with inode cache
- gc_data_segment
 - is_alive
  - f2fs_get_node_page
  - datablock_addr
   - offset_in_addr
   : access uninitialized fields

Fixes: 7a2af766af15 ("f2fs: enhance on-disk inode structure scalability")
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
5 years agof2fs: avoid infinite GC loop due to stale atomic files
Jaegeuk Kim [Mon, 9 Sep 2019 12:10:59 +0000 (13:10 +0100)]
f2fs: avoid infinite GC loop due to stale atomic files

If committing atomic pages is failed when doing f2fs_do_sync_file(), we can
get commited pages but atomic_file being still set like:

- inmem:    0, atomic IO:    4 (Max.   10), volatile IO:    0 (Max.    0)

If GC selects this block, we can get an infinite loop like this:

f2fs_submit_page_bio: dev = (253,7), ino = 2, page_index = 0x2359a8, oldaddr = 0x2359a8, newaddr = 0x2359a8, rw = READ(), type = COLD_DATA
f2fs_submit_read_bio: dev = (253,7)/(253,7), rw = READ(), DATA, sector = 18533696, size = 4096
f2fs_get_victim: dev = (253,7), type = No TYPE, policy = (Foreground GC, LFS-mode, Greedy), victim = 4355, cost = 1, ofs_unit = 1, pre_victim_secno = 4355, prefree = 0, free = 234
f2fs_iget: dev = (253,7), ino = 6247, pino = 5845, i_mode = 0x81b0, i_size = 319488, i_nlink = 1, i_blocks = 624, i_advise = 0x2c
f2fs_submit_page_bio: dev = (253,7), ino = 2, page_index = 0x2359a8, oldaddr = 0x2359a8, newaddr = 0x2359a8, rw = READ(), type = COLD_DATA
f2fs_submit_read_bio: dev = (253,7)/(253,7), rw = READ(), DATA, sector = 18533696, size = 4096
f2fs_get_victim: dev = (253,7), type = No TYPE, policy = (Foreground GC, LFS-mode, Greedy), victim = 4355, cost = 1, ofs_unit = 1, pre_victim_secno = 4355, prefree = 0, free = 234
f2fs_iget: dev = (253,7), ino = 6247, pino = 5845, i_mode = 0x81b0, i_size = 319488, i_nlink = 1, i_blocks = 624, i_advise = 0x2c

In that moment, we can observe:

[Before]
Try to move 5084219 blocks (BG: 384508)
  - data blocks : 4962373 (274483)
  - node blocks : 121846 (110025)
Skipped : atomic write 4534686 (10)

[After]
Try to move 5088973 blocks (BG: 384508)
  - data blocks : 4967127 (274483)
  - node blocks : 121846 (110025)
Skipped : atomic write 4539440 (10)

So, refactor atomic_write flow like this:
1. start_atomic_write
 - add inmem_list and set atomic_file

2. write()
 - register it in inmem_pages

3. commit_atomic_write
 - if no error, f2fs_drop_inmem_pages()
 - f2fs_commit_inmme_pages() failed
   : __revoked_inmem_pages() was done
 - f2fs_do_sync_file failed
   : abort_atomic_write later

4. abort_atomic_write
 - f2fs_drop_inmem_pages

5. f2fs_drop_inmem_pages
 - clear atomic_file
 - remove inmem_list

Based on this change, when GC fails to move block in atomic_file,
f2fs_drop_inmem_pages_all() can call f2fs_drop_inmem_pages().

Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
5 years agonet: phylink: clarify where phylink should be used
Russell King [Sat, 14 Sep 2019 09:44:04 +0000 (10:44 +0100)]
net: phylink: clarify where phylink should be used

Update the phylink documentation to make it clear that phylink is
designed to be used on the MAC facing side of the link, rather than
between a SFP and PHY.

Signed-off-by: Russell King <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agoMerge branch 'bnxt_en-error-recovery-follow-up-patches'
David S. Miller [Mon, 16 Sep 2019 14:44:28 +0000 (16:44 +0200)]
Merge branch 'bnxt_en-error-recovery-follow-up-patches'

Michael Chan says:

====================
bnxt_en: error recovery follow-up patches.

A follow-up patchset for the recently added health and error recovery
feature.  The first fix is to prevent .ndo_set_rx_mode() from proceeding
when reset is in progress.  The 2nd fix is for the firmware coredump
command.  The 3rd and 4th patches update the error recovery process
slightly to add a state that polls and waits for the firmware to be down.
====================

Signed-off-by: David S. Miller <[email protected]>
5 years agobnxt_en: Add a new BNXT_FW_RESET_STATE_POLL_FW_DOWN state.
Vasundhara Volam [Sat, 14 Sep 2019 04:01:41 +0000 (00:01 -0400)]
bnxt_en: Add a new BNXT_FW_RESET_STATE_POLL_FW_DOWN state.

This new state is required when firmware indicates that the error
recovery process requires polling for firmware state to be completely
down before initiating reset.  For example, firmware may take some
time to collect the crash dump before it is down and ready to be
reset.

Signed-off-by: Vasundhara Volam <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agobnxt_en: Update firmware interface spec. to 1.10.0.100.
Michael Chan [Sat, 14 Sep 2019 04:01:40 +0000 (00:01 -0400)]
bnxt_en: Update firmware interface spec. to 1.10.0.100.

Some error recovery updates to the spec., among other minor changes.

Signed-off-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agobnxt_en: Increase timeout for HWRM_DBG_COREDUMP_XX commands
Vasundhara Volam [Sat, 14 Sep 2019 04:01:39 +0000 (00:01 -0400)]
bnxt_en: Increase timeout for HWRM_DBG_COREDUMP_XX commands

Firmware coredump messages take much longer than standard messages,
so increase the timeout accordingly.

Fixes: 6c5657d085ae ("bnxt_en: Add support for ethtool get dump.")
Signed-off-by: Vasundhara Volam <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agobnxt_en: Don't proceed in .ndo_set_rx_mode() when device is not in open state.
Michael Chan [Sat, 14 Sep 2019 04:01:38 +0000 (00:01 -0400)]
bnxt_en: Don't proceed in .ndo_set_rx_mode() when device is not in open state.

Check the BNXT_STATE_OPEN flag instead of netif_running() in
bnxt_set_rx_mode().  If the driver is going through any reset, such
as firmware reset or even TX timeout, it may not be ready to set the RX
mode and may crash.  The new rx mode settings will be picked up when
the device is opened again later.

Fixes: 230d1f0de754 ("bnxt_en: Handle firmware reset.")
Signed-off-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonull_blk: format pr_* logs with pr_fmt
André Almeida [Mon, 16 Sep 2019 14:07:59 +0000 (11:07 -0300)]
null_blk: format pr_* logs with pr_fmt

Instead of writing "null_blk: " at the beginning of each
pr_err/info/warn log message, format messages using pr_fmt() macro.

Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: André Almeida <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
5 years agonull_blk: match the type of parameter nr_devices
André Almeida [Mon, 16 Sep 2019 14:07:58 +0000 (11:07 -0300)]
null_blk: match the type of parameter nr_devices

Since the variable nr_devices is an unsigned int, the module_param()
should also use this type. Change the type so they can match.

Fixes: f7c4ce890dd2 ("null_blk: validate the number of devices")
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: André Almeida <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
5 years agonull_blk: do not fail the module load with zero devices
André Almeida [Mon, 16 Sep 2019 14:07:57 +0000 (11:07 -0300)]
null_blk: do not fail the module load with zero devices

The module load should fail only if there is something wrong with the
configuration or if an error prevents it to work properly. The module
should be able to be loaded with (nr_device == 0), since it will not
trigger errors or be in malfunction state. Preventing loading with zero
devices also breaks applications that configures this module using
configfs API. Remove the nr_device check to fix this.

Fixes: f7c4ce890dd2 ("null_blk: validate the number of devices")
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: André Almeida <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
5 years agoARM: dts: dir685: Drop spi-cpol from the display
Linus Walleij [Sun, 15 Sep 2019 13:54:44 +0000 (15:54 +0200)]
ARM: dts: dir685: Drop spi-cpol from the display

The D-Link DIR-685 had its clock polarity set as active
low using the special SPI "spi-cpol" property.

This is not correct: the datasheet clearly states:
"Fix SCL to GND level when not in use" which is
indicative that this line is active high.

After a recent fix making the GPIO-based SPI driver
force the clock line de-asserted at the beginning of
each SPI transaction this reared its ugly head: now
de-asserted was taken to mean the line should be
driven high, but it should be driven low.

Fix this up in the DTS file and the display works again.

Link: https://lore.kernel.org/r/[email protected]
Cc: Mark Brown <[email protected]>
Fixes: 2922d1cc1696 ("spi: gpio: Add SPI_MASTER_GPIO_SS flag")
Signed-off-by: Linus Walleij <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
5 years agotcp: Add snd_wnd to TCP_INFO
Thomas Higdon [Fri, 13 Sep 2019 23:23:35 +0000 (23:23 +0000)]
tcp: Add snd_wnd to TCP_INFO

Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP
performance problems --
> (1) Usually when we're diagnosing TCP performance problems, we do so
> from the sender, since the sender makes most of the
> performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
> From the sender-side the thing that would be most useful is to see
> tp->snd_wnd, the receive window that the receiver has advertised to
> the sender.

This serves the purpose of adding an additional __u32 to avoid the
would-be hole caused by the addition of the tcpi_rcvi_ooopack field.

Signed-off-by: Thomas Higdon <[email protected]>
Acked-by: Yuchung Cheng <[email protected]>
Acked-by: Neal Cardwell <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agotcp: Add TCP_INFO counter for packets received out-of-order
Thomas Higdon [Fri, 13 Sep 2019 23:23:34 +0000 (23:23 +0000)]
tcp: Add TCP_INFO counter for packets received out-of-order

For receive-heavy cases on the server-side, we want to track the
connection quality for individual client IPs. This counter, similar to
the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
tracks out-of-order packet reception. By providing this counter in
TCP_INFO, it will allow understanding to what degree receive-heavy
sockets are experiencing out-of-order delivery and packet drops
indicating congestion.

Please note that this is similar to the counter in NetBSD TCP_INFO, and
has the same name.

Also note that we avoid increasing the size of the tcp_sock struct by
taking advantage of a hole.

Signed-off-by: Thomas Higdon <[email protected]>
Acked-by: Neal Cardwell <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonet: mdio: switch to using gpiod_get_optional()
Dmitry Torokhov [Fri, 13 Sep 2019 22:55:47 +0000 (15:55 -0700)]
net: mdio: switch to using gpiod_get_optional()

The MDIO device reset line is optional and now that gpiod_get_optional()
returns proper value when GPIO support is compiled out, there is no
reason to use fwnode_get_named_gpiod() that I plan to hide away.

Let's switch to using more standard gpiod_get_optional() and
gpiod_set_consumer_name() to keep the nice "PHY reset" label.

Also there is no reason to only try to fetch the reset GPIO when we have
OF node, gpiolib can fetch GPIO data from firmwares as well.

Signed-off-by: Dmitry Torokhov <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agodm: introduce DM_GET_TARGET_VERSION
Mikulas Patocka [Mon, 16 Sep 2019 09:55:42 +0000 (05:55 -0400)]
dm: introduce DM_GET_TARGET_VERSION

This commit introduces a new ioctl DM_GET_TARGET_VERSION. It will load a
target that is specified in the "name" entry in the parameter structure
and return its version.

This functionality is intended to be used by cryptsetup, so that it can
query kernel capabilities before activating the device.

Signed-off-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
David S. Miller [Mon, 16 Sep 2019 14:02:03 +0000 (16:02 +0200)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2019-09-16

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Now that initial BPF backend for gcc has been merged upstream, enable
   BPF kselftest suite for bpf-gcc. Also fix a BE issue with access to
   bpf_sysctl.file_pos, from Ilya.

2) Follow-up fix for link-vmlinux.sh to remove bash-specific extensions
   related to recent work on exposing BTF info through sysfs, from Andrii.

3) AF_XDP zero copy fixes for i40e and ixgbe driver which caused umem
   headroom to be added twice, from Ciara.

4) Refactoring work to convert sock opt tests into test_progs framework
   in BPF kselftests, from Stanislav.

5) Fix a general protection fault in dev_map_hash_update_elem(), from Toke.

6) Cleanup to use BPF_PROG_RUN() macro in KCM, from Sami.
====================

Signed-off-by: David S. Miller <[email protected]>
5 years agoRDMA/bnxt_re: Fix spelling mistake "missin_resp" -> "missing_resp"
Colin Ian King [Wed, 11 Sep 2019 09:28:56 +0000 (10:28 +0100)]
RDMA/bnxt_re: Fix spelling mistake "missin_resp" -> "missing_resp"

There is a spelling mistake in a literal string, fix it.

Fixes: 89f81008baac ("RDMA/bnxt_re: expose detailed stats retrieved from HW")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Colin Ian King <[email protected]>
Acked-by: Selvin Xavier <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
5 years agoRDMA/hns: Package operations of rq inline buffer into separate functions
Lijun Ou [Thu, 29 Aug 2019 08:41:42 +0000 (16:41 +0800)]
RDMA/hns: Package operations of rq inline buffer into separate functions

Here packages the codes of allocating and freeing rq inline buffer in
hns_roce_create_qp_common function in order to reduce the complexity.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Lijun Ou <[email protected]>
Signed-off-by: Weihang Li <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
5 years agoRDMA/hns: Optimize cmd init and mode selection for hip08
Yixian Liu [Thu, 29 Aug 2019 08:41:41 +0000 (16:41 +0800)]
RDMA/hns: Optimize cmd init and mode selection for hip08

There are two modes for mailbox command (cmd) queue, i.e., event mode and
poll mode. For each mode, we use corresponding semaphores to protect the
cmd queue resource competition, so called event_sem and poll_sem. During
cmd init, both semaphores are initialized and poll mode is selected.
Thus, there is no need to up poll_sema again in cmd_use_polling.

Furthermore, there is no need to down the sema of the other side while
switching mode. This patch aims to decouple the switch between event mode
and poll mode of cmd.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Yixian Liu <[email protected]>
Signed-off-by: Weihang Li <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
5 years agoPCI: dwc: Add validation that PCIe core is set to correct mode
Jonathan Chocron [Thu, 12 Sep 2019 13:02:38 +0000 (16:02 +0300)]
PCI: dwc: Add validation that PCIe core is set to correct mode

Some PCIe controllers can be set to either Host or EP according to some
early boot FW. To make sure there is no discrepancy (e.g. FW configured
the port to EP mode while the DT specifies it as a host bridge or vice
versa), a check has been added for each mode.

Signed-off-by: Jonathan Chocron <[email protected]>
Signed-off-by: Lorenzo Pieralisi <[email protected]>
Reviewed-by: Andrew Murray <[email protected]>
Acked-by: Gustavo Pimentel <[email protected]>
5 years agoPCI: dwc: al: Add Amazon Annapurna Labs PCIe controller driver
Jonathan Chocron [Thu, 12 Sep 2019 13:02:37 +0000 (16:02 +0300)]
PCI: dwc: al: Add Amazon Annapurna Labs PCIe controller driver

This driver is DT based and utilizes the DesignWare APIs.

It allows using a smaller ECAM range for a larger bus range -
usually an entire bus uses 1MB of address space, but the driver
can use it for a larger number of buses. This is achieved by using a HW
mechanism which allows changing the BUS part of the "final" outgoing
config transaction. There are 2 HW regs, one which is basically a
bitmask determining which bits to take from the AXI transaction itself
and another which holds the complementary part programmed by the
driver.

All link initializations are handled by the boot FW.

Signed-off-by: Jonathan Chocron <[email protected]>
Signed-off-by: Lorenzo Pieralisi <[email protected]>
Reviewed-by: Gustavo Pimentel <[email protected]>
Reviewed-by: Andrew Murray <[email protected]>
5 years agodt-bindings: PCI: Add Amazon's Annapurna Labs PCIe host bridge binding
Jonathan Chocron [Thu, 12 Sep 2019 13:02:36 +0000 (16:02 +0300)]
dt-bindings: PCI: Add Amazon's Annapurna Labs PCIe host bridge binding

Document Amazon's Annapurna Labs PCIe host bridge.

Signed-off-by: Jonathan Chocron <[email protected]>
Signed-off-by: Lorenzo Pieralisi <[email protected]>
Reviewed-by: Andrew Murray <[email protected]>
Reviewed-by: Rob Herring <[email protected]>
5 years agoPCI: Add quirk to disable MSI-X support for Amazon's Annapurna Labs Root Port
Jonathan Chocron [Thu, 12 Sep 2019 13:00:42 +0000 (16:00 +0300)]
PCI: Add quirk to disable MSI-X support for Amazon's Annapurna Labs Root Port

The Root Port (identified by [1c36:0031]) doesn't support MSI-X. On some
platforms it is configured to not advertise the capability at all, while
on others it (mistakenly) does. This causes a panic during
initialization by the pcieport driver, since it tries to configure the
MSI-X capability. Specifically, when trying to access the MSI-X table
a "non-existing addr" exception occurs.

Example stacktrace snippet:

  SError Interrupt on CPU2, code 0xbf000000 -- SError
  CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc1-Jonny-14847-ge76f1d4a1828-dirty #33
  Hardware name: Annapurna Labs Alpine V3 EVP (DT)
  pstate: 80000005 (Nzcv daif -PAN -UAO)
  pc : __pci_enable_msix_range+0x4e4/0x608
  lr : __pci_enable_msix_range+0x498/0x608
  sp : ffffff80117db700
  x29: ffffff80117db700 x28: 0000000000000001
  x27: 0000000000000001 x26: 0000000000000000
  x25: ffffffd3e9d8c0b0 x24: 0000000000000000
  x23: 0000000000000000 x22: 0000000000000000
  x21: 0000000000000001 x20: 0000000000000000
  x19: ffffffd3e9d8c000 x18: ffffffffffffffff
  x17: 0000000000000000 x16: 0000000000000000
  x15: ffffff80116496c8 x14: ffffffd3e9844503
  x13: ffffffd3e9844502 x12: 0000000000000038
  x11: ffffffffffffff00 x10: 0000000000000040
  x9 : ffffff801165e270 x8 : ffffff801165e268
  x7 : 0000000000000002 x6 : 00000000000000b2
  x5 : ffffffd3e9d8c2c0 x4 : 0000000000000000
  x3 : 0000000000000000 x2 : 0000000000000000
  x1 : 0000000000000000 x0 : ffffffd3e9844680
  Kernel panic - not syncing: Asynchronous SError Interrupt
  CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc1-Jonny-14847-ge76f1d4a1828-dirty #33
  Hardware name: Annapurna Labs Alpine V3 EVP (DT)
  Call trace:
   dump_backtrace+0x0/0x140
   show_stack+0x14/0x20
   dump_stack+0xa8/0xcc
   panic+0x140/0x334
   nmi_panic+0x6c/0x70
   arm64_serror_panic+0x74/0x88
   __pte_error+0x0/0x28
   el1_error+0x84/0xf8
   __pci_enable_msix_range+0x4e4/0x608
   pci_alloc_irq_vectors_affinity+0xdc/0x150
   pcie_port_device_register+0x2b8/0x4e0
   pcie_portdrv_probe+0x34/0xf0

Notice that this quirk also disables MSI (which may work, but hasn't
been tested nor has a current use case), since currently there is no
standard way to disable only MSI-X.

Signed-off-by: Jonathan Chocron <[email protected]>
Signed-off-by: Lorenzo Pieralisi <[email protected]>
Reviewed-by: Gustavo Pimentel <[email protected]>
Reviewed-by: Andrew Murray <[email protected]>
5 years agoPCI/VPD: Prevent VPD access for Amazon's Annapurna Labs Root Port
Jonathan Chocron [Thu, 12 Sep 2019 13:00:41 +0000 (16:00 +0300)]
PCI/VPD: Prevent VPD access for Amazon's Annapurna Labs Root Port

The Amazon Annapurna Labs PCIe Root Port exposes the VPD capability,
but there is no actual support for it.

Trying to access the VPD (for example, as part of lspci -vv or when
reading the vpd sysfs file), results in the following warning print:

  pcieport 0001:00:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update

Signed-off-by: Jonathan Chocron <[email protected]>
Signed-off-by: Lorenzo Pieralisi <[email protected]>
Reviewed-by: Gustavo Pimentel <[email protected]>
Reviewed-by: Andrew Murray <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
This page took 0.123551 seconds and 4 git commands to generate.