spi: bcm2835: Speed up RX-only DMA transfers by zero-filling TX FIFO
The BCM2835 SPI driver currently sets the SPI_CONTROLLER_MUST_TX flag.
When performing an RX-only transfer, this flag causes the SPI core to
allocate and DMA-map a dummy buffer which is copied to the TX FIFO.
The dummy buffer is necessary because the chip is not capable of
automatically clocking out null bytes.
Avoid the overhead induced by the dummy buffer by preallocating a
reusable DMA transaction which fills the TX FIFO by cyclically copying
from the zero page. The transaction requires very little CPU time to
submit and generates no interrupts while running. Specifics are
provided in kerneldoc comments.
[Nathan Chancellor contributed a DMA mapping fixup for an early version
of this commit, hence his Signed-off-by.]
spi: bcm2835: Speed up TX-only DMA transfers by clearing RX FIFO
The BCM2835 SPI driver currently sets the SPI_CONTROLLER_MUST_RX flag.
When performing a TX-only transfer, this flag causes the SPI core to
allocate and DMA-map a dummy buffer into which the RX FIFO contents are
copied. The dummy buffer is necessary because the chip is not capable
of disabling the receiver or automatically throwing away received data.
Not reading the RX FIFO isn't an option either since transmission is
halted once it's full.
Avoid the overhead induced by the dummy buffer by preallocating a
reusable DMA transaction which cyclically clears the RX FIFO. The
transaction requires very little CPU time to submit and generates no
interrupts while running. Specifics are provided in kerneldoc comments.
With a ks8851 Ethernet chip attached to the SPI controller, I am seeing
a 30 us reduction in ping time with this commit (1.819 ms vs. 1.849 ms,
average of 100,000 packets) as well as a 2% reduction in CPU time
(75:08 vs. 76:39 for transmission of 5 GByte over the SPI bus).
The commit uses the TX DMA interrupt to signal completion of a transfer.
This interrupt is raised once all bytes have been written to the
TX FIFO and it is then necessary to busy-wait for the TX FIFO to become
empty before the transfer can be finalized. As an alternative approach,
I have explored using the SPI controller's DONE interrupt to detect
completion. This interrupt is signaled when the TX FIFO becomes empty,
avoiding the need to busy-wait. However latency deteriorates compared
to the present commit and surprisingly, CPU time is slightly higher as
well:
It turns out that in 45% of the cases, no busy-waiting is needed at all
and in 76% of the cases, less than 10 busy-wait iterations are
sufficient for the TX FIFO to drain. This was measured on an RT kernel.
On a vanilla kernel, wakeup latency is worse and thus fewer iterations
are needed. The measurements were made with an SPI clock of 20 MHz,
they may differ slightly for slower or faster clock speeds.
Previously we always used the RX DMA interrupt to signal completion of a
transfer. Using the TX DMA interrupt now introduces a race condition:
TX DMA is always started before RX DMA so that bytes are already clocked
out while RX DMA is still being set up. But if a TX-only transfer is
very short, then the TX DMA interrupt may occur before RX DMA is set up.
If the interrupt happens to occur on the same CPU, setup of RX DMA may
even be delayed until after the interrupt was handled.
I've solved this by having the TX DMA callback clear the RX FIFO while
busy-waiting for the TX FIFO to drain, thus avoiding a dependency on
setup of RX DMA. Additionally, I am using a lock-free mechanism with
two flags, tx_dma_active and rx_dma_active plus memory barriers to
terminate RX DMA either by the TX DMA callback or immediately after
setting it up, whichever wins the race. I've explored an alternative
approach which temporarily disables the TX DMA callback until RX DMA
has been set up (using tasklet_disable(), local_bh_disable() or
local_irq_save()), but the performance was minimally worse.
[Nathan Chancellor contributed a DMA mapping fixup for an early version
of this commit, hence his Signed-off-by.]
dmaengine: bcm2835: Avoid accessing memory when copying zeroes
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:
"Do not perform source reads.
In addition, destination writes will zero all the write strobes.
This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf
The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.
Enable the feature if a cyclic transaction copies from the zero page.
This reduces traffic on the memory bus.
A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
https://github.com/msperl/spi-bcm2835/issues/13#issuecomment-98180451
spi: bcm2835: Cache CS register value for ->prepare_message()
The BCM2835 SPI driver needs to set up the clock polarity in its
->prepare_message() hook before spi_transfer_one_message() asserts chip
select to avoid a gratuitous clock signal edge (cf. commit acace73df2c1
("spi: bcm2835: set up spi-mode before asserting cs-gpio")).
Precalculate the CS register value (which selects the clock polarity)
once in ->setup() and use that cached value in ->prepare_message() and
->transfer_one(). This avoids one MMIO read per message and one per
transfer, yielding a small latency improvement. Additionally, a
forthcoming commit will use the precalculated value to derive the
register value for clearing the RX FIFO, which will eliminate the need
for an RX dummy buffer when performing TX-only DMA transfers.
spi: Guarantee cacheline alignment of driver-private data
__spi_alloc_controller() uses a single allocation to accommodate struct
spi_controller and the driver-private data, but places the latter behind
the former. This order does not guarantee cacheline alignment of the
driver-private data. (It does guarantee cacheline alignment of struct
spi_controller but the structure doesn't make any use of that property.)
Round up struct spi_controller to cacheline size. A forthcoming commit
leverages this to grant DMA access to driver-private data of the BCM2835
SPI master.
An alternative, less economical approach would be to use two allocations.
A third approach consists of reversing the order to conserve memory.
But Mark Brown is concerned that it may result in a performance penalty
on architectures that don't like unaligned accesses.
The DMA engine API requires DMA drivers to explicitly allow that
descriptors are prepared once and reused multiple times. Only a
single driver makes use of this functionality so far (pxa_dma.c,
to speed up pxa_camera.c).
We're about to add another use case for reusable descriptors in
the BCM2835 SPI driver, so allow that in the BCM2835 DMA driver.
dmaengine: bcm2835: Allow cyclic transactions without interrupt
The BCM2835 DMA driver currently requests an interrupt from the
controller regardless whether or not the client has passed in the
DMA_PREP_INTERRUPT flag. This causes unnecessary overhead for cyclic
transactions which do not need an interrupt after each period.
We're about to add such a use case, namely cyclic clearing of the SPI
controller's RX FIFO, so amend the DMA driver to request an interrupt
only if DMA_PREP_INTERRUPT was passed in. Ignore the period_len for
such transactions and set it to the buffer length to make the driver's
calculations work.
The BCM2835 SPI driver uses a flag to keep track of whether a DMA
transfer is in progress.
The flag is used to avoid terminating DMA channels multiple times if a
transfer finishes orderly while simultaneously the SPI core invokes the
->handle_err() callback because the transfer took too long. However
terminating DMA channels multiple times is perfectly fine, so the flag
is unnecessary for this particular purpose.
The flag is also used to avoid invoking bcm2835_spi_undo_prologue()
multiple times under this race condition. However multiple *concurrent*
invocations can no longer happen since commit 2527704d8411 ("spi:
bcm2835: Synchronize with callback on DMA termination") because the
->handle_err() callback now uses the _sync() variant when terminating
DMA channels.
The only raison d'être of the flag is therefore that
bcm2835_spi_undo_prologue() cannot cope with multiple *sequential*
invocations. Achieve that by setting tx_prologue to 0 at the end of
the function. Subsequent invocations thus become no-ops.
With that, the dma_pending flag becomes unnecessary, so drop it.
Vladimir Oltean [Thu, 5 Sep 2019 01:01:11 +0000 (04:01 +0300)]
spi: Use an abbreviated pointer to ctlr->cur_msg in __spi_pump_messages
This helps a bit with line fitting now (the list_first_entry call) as
well as during the next patch which needs to iterate through all
transfers of ctlr->cur_msg so it timestamps them.
spi: npcm-fiu: remove set but not used variable 'retlen'
drivers/spi/spi-npcm-fiu.c: In function npcm_fiu_read:
drivers/spi/spi-npcm-fiu.c:472:9: warning:
variable retlen set but not used [-Wunused-but-set-variable]
Vladimir Oltean [Tue, 3 Sep 2019 10:57:08 +0000 (13:57 +0300)]
spi: spi-fsl-dspi: Fix race condition in TCFQ/EOQ interrupt
When the driver is working in TCFQ/EOQ mode (i.e. interacts with the SPI
controller's FIFOs directly) the following sequence of operations
happens:
- The first byte of the tx buffer gets pushed to the TX FIFO (dspi->len
gets decremented). This triggers the train of interrupts that handle
the rest of the bytes.
- The dspi_interrupt handles a TX confirmation event. It reads the newly
available byte from the RX FIFO, checks the dspi->len exit condition,
and if there's more to be done, it kicks off the next interrupt in the
train by writing the next byte to the TX FIFO.
Now the problem is that the wait queue is woken up one byte too early,
because dspi->len becomes 0 as soon as the byte has been pushed into the
TX FIFO. Its interrupt has not yet been processed and the RX byte has
not been put from the FIFO into the buffer.
Depending on the timing of the wait queue wakeup vs the handling of the
last dspi_interrupt, it can happen that the main SPI message pump thread
has already returned back into the spi_device driver. When the rx buffer
is on stack (which it can be, because in this mode, the DSPI doesn't do
DMA), the last interrupt will perform a memory write into an rx buffer
that has been freed. This manifests as stack corruption.
The solution is to only wake up the wait queue when dspi_rxtx says so,
i.e. after it has processed the last TX confirmation interrupt and
collected the last RX byte.
Introduce new polling mode for short size transfer. Either the estimated
transfer time is estimated to exceed 200us, or polling loop actually exceeds
200us, it switches to irq mode.
The actual device name of the SPI controller being registered on EP93xx is
"spi0" (as seen by gpiod_find_lookup_table()). This patch fixes all
relevant lookup tables and the following failure (seen on EDB9302):
ep93xx-spi ep93xx-spi.0: failed to register SPI master
ep93xx-spi: probe of ep93xx-spi.0 failed with error -22
The FIU supports single, dual or quad communication interface.
the FIU controller can operate in following modes:
- User Mode Access(UMA): provides flash access by using an
indirect address/data mechanism.
- direct rd/wr mode: maps the flash memory into the core
address space.
- SPI-X mode: used for an expansion bus to an ASIC or CPLD.
Linus Walleij [Sun, 4 Aug 2019 00:38:52 +0000 (02:38 +0200)]
spi: bcm2835: Convert to use CS GPIO descriptors
This converts the BCM2835 SPI master driver to use GPIO
descriptors for chip select handling.
The BCM2835 driver was relying on the core to drive the
CS high/low so very small changes were needed for this
part. If it managed to request the CS from the device tree
node, all is pretty straight forward.
However for native GPIOs this driver has a quite unorthodox
loopback to request some GPIOs from the SoC GPIO chip by
looking it up from the device tree using gpiochip_find()
and then offseting hard into its numberspace. This has
been augmented a bit by using gpiochip_request_own_desc()
but this code really needs to be verified. If "native CS"
is actually an SoC GPIO, why is it even done this way?
Should this GPIO not just be defined in the device tree
like any other CS GPIO? I'm confused.
Linus Walleij [Sun, 4 Aug 2019 00:35:39 +0000 (02:35 +0200)]
spi: fsl: Convert to use CS GPIO descriptors
This converts the Freescale SPI master driver to use GPIO
descriptors for chip select handling.
The Freescale (fsl) driver has a lot of quirks to look up
"gpios" rather than "cs-gpios" from the device tree.
After the prior patch that will make gpiolib return the
GPIO descriptor for "gpios" in response to a request for
"cs-gpios", this code can be cut down quite a bit.
The driver has custom handling of chip select rather
than using the core (which may be possible but not
done in this patch) so it still needs to refer directly
to spi->cs_gpiod to set the chip select.
Vladimir Oltean [Thu, 22 Aug 2019 21:15:13 +0000 (00:15 +0300)]
spi: spi-fsl-dspi: Use poll mode in case the platform IRQ is missing
On platforms like LS1021A which use TCFQ mode, an interrupt needs to be
processed after each byte is TXed/RXed. I tried to make the DSPI
implementation on this SoC operate in other, more efficient modes (EOQ,
DMA) but it looks like it simply isn't possible.
Therefore allow the driver to operate in poll mode, to ease a bit of
this absurd amount of IRQ load generated in TCFQ mode. Doing so reduces
both the net time it takes to transmit a SPI message, as well as the
inter-frame jitter that occurs while doing so.
Vladimir Oltean [Thu, 22 Aug 2019 21:15:12 +0000 (00:15 +0300)]
spi: spi-fsl-dspi: Remove impossible to reach error check
dspi->devtype_data is under the total control of the driver. Therefore,
a bad value is a driver bug and checking it at runtime (and during an
ISR, at that!) is pointless.
The second "else if" check is only for clarity (instead of a broader
"else") in case other transfer modes are added in the future. But the
printing is dead code and can be removed.
Vladimir Oltean [Thu, 22 Aug 2019 21:15:11 +0000 (00:15 +0300)]
spi: spi-fsl-dspi: Exit the ISR with IRQ_NONE when it's not ours
The DSPI interrupt can be shared between two controllers at least on the
LX2160A. In that case, the driver for one controller might misbehave and
consume the other's interrupt. Fix this by actually checking if any of
the bits in the status register have been asserted.
Vladimir Oltean [Thu, 22 Aug 2019 21:15:10 +0000 (00:15 +0300)]
spi: spi-fsl-dspi: Reduce indentation level in dspi_interrupt
If the entire function depends on the SPI status register having the
interrupt bits asserted, then just check it and exit early if those bits
aren't set (such as in the case of the shared IRQ being triggered for
the other peripheral). Cosmetic patch.
Vladimir Oltean [Thu, 22 Aug 2019 21:24:50 +0000 (00:24 +0300)]
spi: spi-fsl-dspi: Exit the ISR with IRQ_NONE when it's not ours
The DSPI interrupt can be shared between two controllers at least on the
LX2160A. In that case, the driver for one controller might misbehave and
consume the other's interrupt. Fix this by actually checking if any of
the bits in the status register have been asserted.
Ashish Kumar [Tue, 13 Aug 2019 10:23:09 +0000 (15:53 +0530)]
spi: spi-fsl-qspi: Add ls2080a compatibility string to bindings
There are 2 version of QSPI-IP, according to which controller registers sets
can be big endian or little endian.There are some other minor changes like
RX fifo depth etc.
The big endian version uses driver compatible "fsl,ls1021a-qspi" and
little endian version uses driver compatible "fsl,ls2080a-qspi"
The two functions are loosely coupled through dspi->waitq, but
logically, dspi_transfer_one_message depends on dspi_interrupt in order
to complete. Move its definition above it so the I/O functions are
grouped closer together.
Vladimir Oltean [Sun, 18 Aug 2019 18:01:09 +0000 (21:01 +0300)]
spi: spi-fsl-dspi: Remove pointless assignment of master->transfer to NULL
Introduced in commit 9298bc727385 ("spi: spi-fsl-dspi: Remove
spi-bitbang") for less than obvious reasons, this assignment is
confusing and serves no purpose.
Vladimir Oltean [Sun, 18 Aug 2019 18:01:06 +0000 (21:01 +0300)]
spi: spi-fsl-dspi: Change usage pattern of SPI_MCR_* and SPI_CTAR_* macros
These are macros that accept 0 or 1 as argument (a boolean value). Their
use encourages the abuse of complex ternary operations inside their
argument list, which detracts from the code readability. Replace these
with simple if-else statements.
Jarkko Nikula [Mon, 12 Aug 2019 10:13:44 +0000 (13:13 +0300)]
spi: dw-pci: Add support for Intel Elkhart Lake PSE SPI
Add support for Intel(R) Programmable Services Engine (Intel(R) PSE) SPI
controller in Intel Elkhart Lake when interface is assigned to the host
processor.
Linus Walleij [Thu, 8 Aug 2019 15:03:21 +0000 (17:03 +0200)]
spi: Rename of_spi_register_master() function
Rename this function to of_spi_get_gpio_numbers() as this
is what the function does, it does not register a master,
it is called in the path of registering a master so the
name is logical in a convoluted way, but it is better to
follow Rusty Russell's ABI level no 7:
"The obvious use is (probably) the correct one"
spi: bcm-qspi: Fix BSPI QUAD and DUAL mode support when using flex mode
Fix data transfer width settings based on DT field 'spi-rx-bus-width'
to configure BSPI in single, dual or quad mode by using data width
and not the command width.
spi: atmel: add tracing to custom .transfer_one_message callback
Driver specific implementations for .transfer_one_message need to call
the tracing stuff themself. This is necessary to make spi tracing
actually useful.
Stephen Boyd [Tue, 30 Jul 2019 18:15:41 +0000 (11:15 -0700)]
spi: Remove dev_err() usage after platform_get_irq()
We don't need dev_err() messages when platform_get_irq() fails now that
platform_get_irq() prints an error message itself when something goes
wrong. Let's remove these prints with a simple semantic patch.
Peter Zijlstra [Thu, 1 Aug 2019 11:13:53 +0000 (13:13 +0200)]
spi: Reduce kthread priority
The SPI thingies request FIFO-99 by default, reduce this to FIFO-50.
FIFO-99 is the very highest priority available to SCHED_FIFO and
it not a suitable default; it would indicate the SPI work is the
most important work on the machine.
Baolin Wang [Fri, 26 Jul 2019 07:20:52 +0000 (15:20 +0800)]
spi: sprd: adi: Change hwlock to be optional
Now Spreadtrum ADI controller supplies multiple master accessing channel
to support multiple subsystems accessing, instead of using a hardware
spinlock to synchronize between the multiple subsystems.
To keep backward compatibility, we should change the hardware spinlock
to be optional. Moreover change to use of_hwspin_lock_get_id() function
which return -ENOENT error number to indicate no hwlock support.
spi: sprd: adi: Add a reset reason for watchdog mode
When the system was rebooted by watchdog, now we did not save the watchdog
reset mode which will make system enter a incorrect mode after rebooting.
Thus we should set the watchdog reset mode as default when opening the
watchdog configuration, that means if the system was rebooted by other
reason through the restart_handler(), then we will clear the default
watchdog reset mode to save the correct reset mode.
Commit 6935224da248 ("spi: bcm2835: enable support of 3-wire mode")
added 3-wire support to the BCM2835 SPI driver by setting the REN bit
(Read Enable) in the CS register when receiving data. The REN bit puts
the transmitter in high-impedance state. The driver recognizes that
data is to be received by checking whether the rx_buf of a transfer is
non-NULL.
Commit 3ecd37edaa2a ("spi: bcm2835: enable dma modes for transfers
meeting certain conditions") subsequently broke 3-wire support because
it set the SPI_MASTER_MUST_RX flag which causes spi_map_msg() to replace
rx_buf with a dummy buffer if it is NULL. As a result, rx_buf is
*always* non-NULL if DMA is enabled.
Reinstate 3-wire support by not only checking whether rx_buf is non-NULL,
but also checking that it is not the dummy buffer.