Trond Myklebust [Mon, 23 Feb 2015 23:51:32 +0000 (18:51 -0500)]
NFS: Don't require a filehandle to refresh the inode in nfs_prime_dcache()
If the server does not return a valid set of attributes that we can
use to either create a file or refresh the inode, then there is no
value in calling nfs_prime_dcache().
However if we're just refreshing the inode using the attributes that
the server returned, then it shouldn't matter whether or not we have
a filehandle, as long as we check the fsid+fileid combination.
Trond Myklebust [Mon, 23 Feb 2015 21:15:00 +0000 (16:15 -0500)]
NFSv3: Use the readdir fileid as the mounted-on-fileid
When we call readdirplus, set the fileid normally returned by readdir
as the mounted-on-fileid, since that is commonly the case if there is
a mountpoint. To ensure that we get it right, we only set the flag if
the readdir fileid differs from the one returned in the readdirplus
attributes.
This again means that we can avoid the issues described in commit 2ef47eb1aee17 ("NFS: Fix use of nfs_attr_use_mounted_on_fileid()"),
which only fixed NFSv4.
Trond Myklebust [Sun, 22 Feb 2015 21:35:36 +0000 (16:35 -0500)]
NFS: Don't invalidate a submounted dentry in nfs_prime_dcache()
If we're traversing a directory which contains a submounted filesystem,
or one that has a referral, the NFS server that is processing the READDIR
request will often return information for the underlying (mounted-on)
directory. It may, or may not, also return filehandle information.
If this happens, and the lookup in nfs_prime_dcache() returns the
dentry for the submounted directory, the filehandle comparison will
fail, and we call d_invalidate(). Post-commit 8ed936b5671bf
("vfs: Lazily remove mounts on unlinked files and directories."), this
means the entire subtree is unmounted.
The following minimal patch addresses this problem by punting on
the invalidation if there is a submount.
Kudos to Neil Brown <[email protected]> for having tracked down this
issue (see link).
Trond Myklebust [Thu, 26 Feb 2015 22:57:14 +0000 (17:57 -0500)]
NFS: Remove size hack in nfs_inode_attrs_need_update()
Prior to this patch, we used to always OK attribute updates that extended
the file size on the assumption that we might be performing writeback.
Now that we have attribute barriers to protect the writeback related updates,
we should remove this hack, as it can cause truncate() operations to
apparently be reverted if/when a readahead or getattr RPC call races
with our on-the-wire SETATTR.
Trond Myklebust [Thu, 26 Feb 2015 21:09:04 +0000 (16:09 -0500)]
NFS: Add attribute update barriers to nfs_setattr_update_inode()
Ensure that other operations which raced with our setattr RPC call
cannot revert the file attribute changes that were made on the server.
To do so, we artificially bump the attribute generation counter on
the inode so that all calls to nfs_fattr_init() that precede ours
will be dropped.
The motivation for the patch came from Chuck Lever's reports of readaheads
racing with truncate operations and causing the file size to be reverted.
Trond Myklebust [Sat, 28 Feb 2015 03:54:19 +0000 (22:54 -0500)]
NFS: Ensure that buffered writes wait for O_DIRECT writes to complete
The O_DIRECT code will grab the inode->i_mutex and flush out buffered
writes, before scheduling a read or a write. However there is no
equivalent in the buffered write code to wait for O_DIRECT to complete.
Fixes a reported issue in xfstests generic/133, when first performing an
O_DIRECT write followed by a buffered write.
Set the internal device state to to disabled after hardware reset in stop flow.
This will cover cases when driver was not brought to disabled state because of
an error and in stop flow we wish not to retry the reset.
Ian Abbott [Fri, 27 Feb 2015 16:04:42 +0000 (16:04 +0000)]
staging: comedi: adv_pci1710: fix AI INSN_READ for non-zero channel
Reading of analog input channels by the `INSN_READ` comedi instruction
is broken for all except channel 0. `pci171x_ai_insn_read()` calls
`pci171x_ai_read_sample()` with the wrong value for the third parameter.
It is supposed to be the current index in a channel list (which is
always of length 1 in this case, so the index should be 0), but instead
it is passing the actual channel number. `pci171x_ai_read_sample()`
checks the channel number encoded in the raw sample value read from the
hardware matches the channel number stored in the specified index of the
previously set up channel list and returns `-ENODATA` if it doesn't
match. Since the index should always be 0 in this case, the match will
fail unless the channel number is also 0. Fix it by passing 0 as the
channel index.
Note that when the bug first appeared, it was `pci171x_ai_dropout()`
that was called with the wrong parameter value. `pci171x_ai_dropout()`
got replaced with `pci171x_ai_read_sample()` in commit 7fd2dae2500d
("staging: comedi: adv_pci1710: introduce pci171x_ai_read_sample()").
Andrey Ryabinin [Fri, 27 Feb 2015 17:44:21 +0000 (20:44 +0300)]
android: binder: fix binder mmap failures
binder_update_page_range() initializes only addr and size
fields in 'struct vm_struct tmp_area;' and passes it to
map_vm_area().
Before 71394fe50146 ("mm: vmalloc: add flag preventing guard hole allocation")
this was because map_vm_area() didn't use any other fields
in vm_struct except addr and size.
Now get_vm_area_size() (used in map_vm_area()) reads vm_struct's
flags to determine whether vm area has guard hole or not.
binder_update_page_range() don't initialize flags field, so
this causes following binder mmap failures:
-----------[ cut here ]------------
WARNING: CPU: 0 PID: 1971 at mm/vmalloc.c:130
vmap_page_range_noflush+0x119/0x144()
CPU: 0 PID: 1971 Comm: healthd Not tainted 4.0.0-rc1-00399-g7da3fdc-dirty #157
Hardware name: ARM-Versatile Express
[<c001246d>] (unwind_backtrace) from [<c000f7f9>] (show_stack+0x11/0x14)
[<c000f7f9>] (show_stack) from [<c049a221>] (dump_stack+0x59/0x7c)
[<c049a221>] (dump_stack) from [<c001cf21>] (warn_slowpath_common+0x55/0x84)
[<c001cf21>] (warn_slowpath_common) from [<c001cfe3>]
(warn_slowpath_null+0x17/0x1c)
[<c001cfe3>] (warn_slowpath_null) from [<c00c66c5>]
(vmap_page_range_noflush+0x119/0x144)
[<c00c66c5>] (vmap_page_range_noflush) from [<c00c716b>] (map_vm_area+0x27/0x48)
[<c00c716b>] (map_vm_area) from [<c038ddaf>]
(binder_update_page_range+0x12f/0x27c)
[<c038ddaf>] (binder_update_page_range) from [<c038e857>]
(binder_mmap+0xbf/0x1ac)
[<c038e857>] (binder_mmap) from [<c00c2dc7>] (mmap_region+0x2eb/0x4d4)
[<c00c2dc7>] (mmap_region) from [<c00c3197>] (do_mmap_pgoff+0x1e7/0x250)
[<c00c3197>] (do_mmap_pgoff) from [<c00b35b5>] (vm_mmap_pgoff+0x45/0x60)
[<c00b35b5>] (vm_mmap_pgoff) from [<c00c1f39>] (SyS_mmap_pgoff+0x5d/0x80)
[<c00c1f39>] (SyS_mmap_pgoff) from [<c000ce81>] (ret_fast_syscall+0x1/0x5c)
---[ end trace 48c2c4b9a1349e54 ]---
binder: 1982: binder_alloc_buf failed to map page at f0e00000 in kernel
binder: binder_mmap: 1982 b6bde000-b6cdc000 alloc small buf failed -12
Use map_kernel_range_noflush() instead of map_vm_area() as this is better
API for binder's purposes and it allows to get rid of 'vm_struct tmp_area' at all.
Linus Torvalds [Sun, 1 Mar 2015 20:22:44 +0000 (12:22 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"A CR4-shadow 32-bit init fix, plus two typo fixes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Init per-cpu shadow copy of CR4 on 32-bit CPUs too
x86/platform/intel-mid: Fix trivial printk message typo in intel_mid_arch_setup()
x86/cpu/intel: Fix trivial typo in intel_tlb_table[]
Linus Torvalds [Sun, 1 Mar 2015 19:56:13 +0000 (11:56 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Two kprobes fixes and a handful of tooling fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf tools: Make sparc64 arch point to sparc
perf symbols: Define EM_AARCH64 for older OSes
perf top: Fix SIGBUS on sparc64
perf tools: Fix probing for PERF_FLAG_FD_CLOEXEC flag
perf tools: Fix pthread_attr_setaffinity_np build error
perf tools: Define _GNU_SOURCE on pthread_attr_setaffinity_np feature check
perf bench: Fix order of arguments to memcpy_alloc_mem
kprobes/x86: Check for invalid ftrace location in __recover_probed_insn()
kprobes/x86: Use 5-byte NOP when the code might be modified by ftrace
David S. Miller [Sun, 1 Mar 2015 19:02:24 +0000 (14:02 -0500)]
Merge branch 'bcmgenet_systemport_stats'
Florian Fainelli says:
====================
net: bcmgenet and systemport statistics fixes
This two patches fix a similar problem in the GENET and SYSTEMPORT drivers
for software maintained statistics used to track DMA mapping and SKB
re-allocation failures.
====================
Commit 60b4ea1781fd ("net: systemport: log RX buffer allocation and RX/TX DMA
failures") added a few software maintained statistics using
BCM_SYSPORT_STAT_MIB_RX and BCM_SYSPORT_STAT_MIB_TX. These statistics are read
from the hardware MIB counters, such that bcm_sysport_update_mib_counters() was
trying to read from a non-existing MIB offset for these counters.
Fix this by introducing a special type: BCM_SYSPORT_STAT_SOFT, similar to
BCM_SYSPORT_STAT_NETDEV, such that bcm_sysport_get_ethtool_stats will read from
the software mib.
Fixes: 60b4ea1781fd ("net: systemport: log RX buffer allocation and RX/TX DMA failures") Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Commit 44c8bc3ce39f ("net: bcmgenet: log RX buffer allocation and RX/TX dma
failures") added a few software maintained statistics using
BCMGENET_STAT_MIB_RX and BCMGENET_STAT_MIB_TX. These statistics are read from
the hardware MIB counters, such that bcmgenet_update_mib_counters() was trying
to read from a non-existing MIB offset for these counters.
Fix this by introducing a special type: BCMGENET_STAT_SOFT, similar to
BCMGENET_STAT_NETDEV, such that bcmgenet_get_ethtool_stats will read from the
software mib.
Fixes: 44c8bc3ce39f ("net: bcmgenet: log RX buffer allocation and RX/TX dma failures") Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Arvid Brodin [Fri, 27 Feb 2015 20:26:03 +0000 (21:26 +0100)]
net/hsr: Fix NULL pointer dereference and refcnt bugs when deleting a HSR interface.
To repeat:
$ sudo ip link del hsr0
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [<ffffffff8187f495>] hsr_del_port+0x15/0xa0
etc...
Bug description:
As part of the hsr master device destruction, hsr_del_port() is called for each of
the hsr ports. At each such call, the master device is updated regarding features
and mtu. When the master device is freed before the slave interfaces, master will
be NULL in hsr_del_port(), which led to a NULL pointer dereference.
Additionally, dev_put() was called on the master device itself in hsr_del_port(),
causing a refcnt error.
A third bug in the same code path was that the rtnl lock was not taken before
hsr_del_port() was called as part of hsr_dev_destroy().
The reporter (Nicolas Dichtel) also said: "hsr_netdev_notify() supposes that the
port will always be available when the notification is for an hsr interface. It's
wrong. For example, netdev_wait_allrefs() may resend NETDEV_UNREGISTER.". As a
precaution against this, a check for port == NULL was added in hsr_dev_notify().
Reported-by: Nicolas Dichtel <[email protected]> Fixes: 51f3c605318b056a ("net/hsr: Move slave init to hsr_slave.c.") Signed-off-by: Arvid Brodin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Vaishali Thakkar [Fri, 27 Feb 2015 18:50:59 +0000 (00:20 +0530)]
net: pasemi: Use setup_timer and mod_timer
Use timer API functions setup_timer and mod_timer instead
of structure assignments as they are standard way to set
the timer and to update the expire field of an active timer
respectively.
This is done using Coccinelle and semantic patch used for
this is as follows:
Vaishali Thakkar [Fri, 27 Feb 2015 18:42:34 +0000 (00:12 +0530)]
net: stmmac: Use setup_timer and mod_timer
Use timer API functions setup_timer and mod_timer instead
of structure assignments as they are standard way to set
the timer and to update the expire field of an active timer
respectively.
This is done using Coccinelle and semantic patch used for
this is as follows:
Vaishali Thakkar [Fri, 27 Feb 2015 18:32:34 +0000 (00:02 +0530)]
net: 8390: axnet_cs: Use setup_timer and mod_timer
Use timer API functions setup_timer and mod_timer instead
of structure assignments as they are standard way to set
the timer and to update the expire field of an active timer
respectively.
This is done using Coccinelle and semantic patch used for
this is as follows:
Vaishali Thakkar [Fri, 27 Feb 2015 18:23:03 +0000 (23:53 +0530)]
net: 8390: pcnet_cs: Use setup_timer and mod_timer
Use timer API functions setup_timer and mod_timer instead
of structure assignments as they are standard way to set
the timer and to update the expire field of an active timer
respectively.
This is done using Coccinelle and semantic patch used for
this is as follows:
Vaishali Thakkar [Fri, 27 Feb 2015 18:10:24 +0000 (23:40 +0530)]
net: smc91c92_cs: Use setup_timer and mod_timer
Use timer API functions setup_timer and mod_timer instead
of structure assignments as they are standard way to set
the timer and to update the expire field of an active timer
respectively.
This is done using Coccinelle and semantic patch used for
this is as follows:
Setting a dev_pm_ops suspend/resume pair but not a set of
hibernation functions means those pm functions will not be
called upon hibernation.
Fix this by using SIMPLE_DEV_PM_OPS, which appropriately
assigns the suspend and hibernation handlers and move
cpsw_suspend/resume calbacks under CONFIG_PM_SLEEP
to avoid build warnings.
Setting a dev_pm_ops suspend_late/resume_early pair but not a
set of hibernation functions means those pm functions will
not be called upon hibernation.
Fix this by using SET_LATE_SYSTEM_SLEEP_PM_OPS, which appropriately
assigns the suspend and hibernation handlers and move
davinci_mdio_x callbacks under CONFIG_PM_SLEEP to avoid build warnings.
For received packet stream, the offset of 'RX_SEQ_START' locates after
the offset of 'RX_NUMBER_MIDI', although current macro and proc output
includes wrong offsets.
Fortunately, this bug doesn't affect streaming functionality because
these macro is not used.
locking/rtmutex: Set state back to running on error
The "usual" path is:
- rt_mutex_slowlock()
- set_current_state()
- task_blocks_on_rt_mutex() (ret 0)
- __rt_mutex_slowlock()
- sleep or not but do return with __set_current_state(TASK_RUNNING)
- back to caller.
In the early error case where task_blocks_on_rt_mutex() return
-EDEADLK we never change the task's state back to RUNNING. I
assume this is intended. Without this change after ww_mutex
using rt_mutex the selftest passes but later I get plenty of:
Eric Dumazet [Fri, 27 Feb 2015 17:42:50 +0000 (09:42 -0800)]
net: do not use rcu in rtnl_dump_ifinfo()
We did a failed attempt in the past to only use rcu in rtnl dump
operations (commit e67f88dd12f6 "net: dont hold rtnl mutex during
netlink dump callbacks")
Now that dumps are holding RTNL anyway, there is no need to also
use rcu locking, as it forbids any scheduling ability, like
GFP_KERNEL allocations that controlling path should use instead
of GFP_ATOMIC whenever possible.
This should fix following splat Cong Wang reported :
[ INFO: suspicious RCU usage. ]
3.19.0+ #805 Tainted: G W
include/linux/rcupdate.h:538 Illegal context switch in RCU read-side critical section!
Commit 740c7f31c094703c ("sh_eth: Ensure DMA engines are stopped before
freeing buffers") added a call to sh_eth_reset() to the
sh_eth_set_ringparam() and sh_eth_close() paths.
However, setting the software reset bit(s) in the EDMR register resets
the MAC Address Registers to zero. Hence after kexec, the new kernel
doesn't detect a valid MAC address and assigns a random MAC address,
breaking DHCP.
Set the MAC address again after the reset in sh_eth_dev_exit() to fix
this.
Tested on r8a7740/armadillo (GETHER) and r8a7791/koelsch (FAST_RCAR).
Fixes: 740c7f31c094703c ("sh_eth: Ensure DMA engines are stopped before freeing buffers") Signed-off-by: Geert Uytterhoeven <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jaedon Shin [Sat, 28 Feb 2015 02:48:26 +0000 (11:48 +0900)]
net: bcmgenet: fix throughtput regression
This patch adds bcmgenet_tx_poll for the tx_rings. This can reduce the
interrupt load and send xmit in network stack on time. This also
separated for the completion of tx_ring16 from bcmgenet_poll.
The bcmgenet_tx_reclaim of tx_ring[{0,1,2,3}] operative by an interrupt
is to be not more than a certain number TxBDs. It is caused by too
slowly reclaiming the transmitted skb. Therefore, performance
degradation of xmit after 605ad7f ("tcp: refine TSO autosizing").
David S. Miller [Sun, 1 Mar 2015 04:33:53 +0000 (23:33 -0500)]
Merge tag 'mac80211-for-davem-2015-02-27' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
A few patches have accumulated, among them the fix for Linus's
four-way-handshake problem. The others are various small fixes
for problems all over, nothing really stands out.
====================
cpuidle / sleep: Do sanity checks in cpuidle_enter_freeze() too
Modify cpuidle_enter_freeze() to do the sanity checks done by
cpuidle_select() to avoid crashing the suspend-to-idle code
path in case something is missing.
Fixes: 381063133246 (PM / sleep: Re-implement suspend-to-idle handling) Original-by: Lorenzo Pieralisi <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
idle / sleep: Avoid excessive disabling and enabling interrupts
Disabling interrupts at the end of cpuidle_enter_freeze() is not
useful, because its caller, cpuidle_idle_call(), re-enables them
right away after invoking it.
To avoid that unnecessary back and forth dance with interrupts,
make cpuidle_enter_freeze() enable interrupts after calling
enter_freeze_proper() and drop the local_irq_disable() at its
end, so that all of the code paths in it end up with interrupts
enabled. Then, cpuidle_idle_call() will not need to re-enable
interrupts after calling cpuidle_enter_freeze() any more, because
the latter will return with interrupts enabled, in analogy with
cpuidle_enter().
When applicable verify that the caller has permisson to the underlying
network namespace for a newly created network device.
Similary checks exist for the network namespace a network device will
be created in.
Fixes: 317f4810e45e ("rtnl: allow to create device with IFLA_LINK_NETNSID set") Signed-off-by: "Eric W. Biederman" <[email protected]> Acked-by: Nicolas Dichtel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
When applicable verify that the caller has permision to create a
network device in another network namespace. This check is already
present when moving a network device between network namespaces in
setlink so all that is needed is to duplicate that check in newlink.
This change almost backports cleanly, but there are context conflicts
as the code that follows was added in v4.0-rc1
Fixes: b51642f6d77b net: Enable a userns root rtnl calls that are safe for unprivilged users Signed-off-by: "Eric W. Biederman" <[email protected]> Acked-by: Nicolas Dichtel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
drivers: net: cpsw: Set SECURE for dual_emac ucast
Prior to this patch, sending a packet with the source MAC address of one
of the CPSW interfaces to one of the CPSW slave ports while it's configured in
dual_emac mode would update the port_num field of the VLAN/Unicast Address
Table Entry. This would cause it to discard all incoming traffic addressed to
that MAC address, essentially rendering the port useless until the ALE table is
cleared (by starting and stopping the interface or rebooting.)
For example, if eth0 has a MAC address of 90:59:af:8f:43:e9 it will have
an ALE table entry:
If you configure another device with the same MAC address and connect it
to the first CPSW slave port and send some traffic the ALE table entry
becomes:
Linus Torvalds [Sat, 28 Feb 2015 18:36:48 +0000 (10:36 -0800)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Just general fixes: radeon, i915, atmel, tegra, amdkfd and one core
fix"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (28 commits)
drm: atmel-hlcdc: remove clock polarity from crtc driver
drm/radeon: only enable DP audio if the monitor supports it
drm/radeon: fix atom aux payload size check for writes (v2)
drm/radeon: fix 1 RB harvest config setup for TN/RL
drm/radeon: enable SRBM timeout interrupt on EG/NI
drm/radeon: enable SRBM timeout interrupt on SI
drm/radeon: enable SRBM timeout interrupt on CIK v2
drm/radeon: dump full IB if we hit a packet error
drm/radeon: disable mclk switching with 120hz+ monitors
drm/radeon: use drm_mode_vrefresh() rather than mode->vrefresh
drm/radeon: enable native backlight control on old macs
drm/i915: Fix frontbuffer false positve.
drm/i915: Align initial plane backing objects correctly
drm/i915: avoid processing spurious/shared interrupts in low-power states
drm/i915: Check obj->vma_list under the struct_mutex
drm/i915: Fix a use after free, and unbalanced refcounting
drm: atmel-hlcdc: remove useless pm_runtime_put_sync in probe
drm: atmel-hlcdc: reset layer A2Q and UPDATE bits when disabling it
drm: Fix deadlock due to getconnector locking changes
drm/i915: Dell Chromebook 11 has PWM backlight
...
Linus Torvalds [Sat, 28 Feb 2015 18:06:33 +0000 (10:06 -0800)]
Merge tag 'xfs-for-linus-4.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs fixes from Dave Chinner:
"These are fixes for regressions/bugs introduced in the 4.0 merge cycle
and problems discovered during the merge window that need to be pushed
back to stable kernels ASAP.
This contains:
- ensure quota type is reset in on-disk dquots
- fix missing partial EOF block data flush on truncate extension
- fix transaction leak in error handling for new pnfs block layout
support
- add missing target_ip check to RENAME_EXCHANGE"
* tag 'xfs-for-linus-4.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: cancel failed transaction in xfs_fs_commit_blocks()
xfs: Ensure we have target_ip for RENAME_EXCHANGE
xfs: ensure truncate forces zeroed blocks to disk
xfs: Fix quota type in quota structures when reusing quota file
Dan Carpenter [Thu, 26 Feb 2015 16:56:56 +0000 (19:56 +0300)]
niu: fix error handling in niu_class_to_ethflow()
There is a discrepancy here because the niu_class_to_ethflow() returns
zero on failure and one on success but the caller expected zero on
success and negative on failure.
The problem means that we allow the user to pass classes and flow_types
which we don't want. I've looked at it a bit and I don't see it as a
very serious bug.
Core mm expects __PAGETABLE_{PUD,PMD}_FOLDED to be defined if these page
table levels folded. Usually, these defines are provided by
<asm-generic/pgtable-nopmd.h> and <asm-generic/pgtable-nopud.h>.
But some architectures fold page table levels in a custom way. They
need to define these macros themself. This patch adds missing defines.
The patch fixes mm->nr_pmds underflow and eliminates dead __pmd_alloc()
and __pud_alloc() on architectures without these page table levels.
Historically, !__GFP_FS allocations were not allowed to invoke the OOM
killer once reclaim had failed, but nevertheless kept looping in the
allocator.
Commit 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into
allocation slowpath"), which should have been a simple cleanup patch,
accidentally changed the behavior to aborting the allocation at that
point. This creates problems with filesystem callers (?) that currently
rely on the allocator waiting for other tasks to intervene.
Revert the behavior as it shouldn't have been changed as part of a
cleanup patch.
Joonsoo Kim [Fri, 27 Feb 2015 23:52:01 +0000 (15:52 -0800)]
zram: use proper type to update max_used_pages
max_used_pages is defined as atomic_long_t so we need to use unsigned
long to keep temporary value for it rather than int which is smaller
than unsigned long in a 64 bit system.
Joshua Kinard [Fri, 27 Feb 2015 23:51:59 +0000 (15:51 -0800)]
drivers/rtc/rtc-ds1685.c: fix conditional in ds1685_rtc_sysfs_time_regs_{show,store}
Fix a conditional statement checking for NULL in both
ds1685_rtc_sysfs_time_regs_show and ds1685_rtc_sysfs_time_regs_store
that was using a logical AND when it should be using a logical OR so
that we fail out of the function properly if the condition ever
evaluates to true.
Fixes: aaaf5fbf56f1 ("rtc: add driver for DS1685 family of real time clocks") Signed-off-by: Joshua Kinard <[email protected]> Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Ryusuke Konishi [Fri, 27 Feb 2015 23:51:56 +0000 (15:51 -0800)]
nilfs2: fix potential memory overrun on inode
Each inode of nilfs2 stores a root node of a b-tree, and it turned out to
have a memory overrun issue:
Each b-tree node of nilfs2 stores a set of key-value pairs and the number
of them (in "bn_nchildren" member of nilfs_btree_node struct), as well as
a few other "bn_*" members.
Since the value of "bn_nchildren" is used for operations on the key-values
within the b-tree node, it can cause memory access overrun if a large
number is incorrectly set to "bn_nchildren".
For instance, nilfs_btree_node_lookup() function determines the range of
binary search with it, and too large "bn_nchildren" leads
nilfs_btree_node_get_key() in that function to overrun.
As for intermediate b-tree nodes, this is prevented by a sanity check
performed when each node is read from a drive, however, no sanity check
has been done for root nodes stored in inodes.
This patch fixes the issue by adding missing sanity check against b-tree
root nodes so that it's called when on-memory inodes are read from ifile,
inode metadata file.
This got lost during the initial merge process: Python requires an
__init__.py script, even if empty, in order to accept a directory as
package. Add it, this time as a non-empty file.
rtc: ds1685: remove superfluous checks for out-of-range u8 values
drivers/rtc/rtc-ds1685.c: In function `ds1685_rtc_read_alarm':
drivers/rtc/rtc-ds1685.c:402: warning: comparison is always true due to limited range of data type
drivers/rtc/rtc-ds1685.c:409: warning: comparison is always true due to limited range of data type
drivers/rtc/rtc-ds1685.c:416: warning: comparison is always true due to limited range of data type
drivers/rtc/rtc-ds1685.c: In function `ds1685_rtc_set_alarm':
drivers/rtc/rtc-ds1685.c:475: warning: comparison is always true due to limited range of data type
drivers/rtc/rtc-ds1685.c:478: warning: comparison is always true due to limited range of data type
drivers/rtc/rtc-ds1685.c:481: warning: comparison is always true due to limited range of data type
u8 cannot contain a value larger than 0xff, hence drop the checks.
Wrapping the checks in unlikely() indicated some sense of humor, though ;-)
The newly added ds1685 driver causes a build error when enabled without
CONFIG_RTC_INTF_DEV:
drivers/rtc/rtc-ds1685.c:919:22: error: 'ds1685_rtc_alarm_irq_enable' undeclared here (not in a function)
.alarm_irq_enable = ds1685_rtc_alarm_irq_enable,
Apparently the driver was incorrectly changed to reflect the interface
change from 16380c153a69c ("RTC: Convert rtc drivers to use the
alarm_irq_enable method"), which removed the respective #ifdef from all
other rtc drivers.
This does the same change that was merged for the other drivers before and
removes the #ifdef, allowing the interrupts to be enabled through the
in-kernel rtc interface independent of the existence of /dev/rtc.
Michal Hocko [Fri, 27 Feb 2015 23:51:46 +0000 (15:51 -0800)]
memcg: fix low limit calculation
A memcg is considered low limited even when the current usage is equal to
the low limit. This leads to interesting side effects e.g.
groups/hierarchies with no memory accounted are considered protected and
so the reclaim will emit MEMCG_LOW event when encountering them.
Another and much bigger issue was reported by Joonsoo Kim. He has hit a
NULL ptr dereference with the legacy cgroup API which even doesn't have
low limit exposed. The limit is 0 by default but the initial check fails
for memcg with 0 consumption and parent_mem_cgroup() would return NULL if
use_hierarchy is 0 and so page_counter_read would try to dereference NULL.
I suppose that the current implementation is just an overlook because the
documentation in Documentation/cgroups/unified-hierarchy.txt says:
"The memory.low boundary on the other hand is a top-down allocated
reserve. A cgroup enjoys reclaim protection when it and all its
ancestors are below their low boundaries"
Fix the usage and the low limit comparision in mem_cgroup_low accordingly.
Joonsoo Kim [Fri, 27 Feb 2015 23:51:43 +0000 (15:51 -0800)]
mm/nommu: fix memory leak
Maxime reported the following memory leak regression due to commit dbc8358c7237 ("mm/nommu: use alloc_pages_exact() rather than its own
implementation").
On v3.19, I am facing a memory leak. Each time I run a command one page
is lost. Here an example with busybox's free command:
/ # free
total used free shared buffers cached
Mem: 7928 1972 5956 0 0 492
-/+ buffers/cache: 1480 6448
/ # free
total used free shared buffers cached
Mem: 7928 1976 5952 0 0 492
-/+ buffers/cache: 1484 6444
/ # free
total used free shared buffers cached
Mem: 7928 1980 5948 0 0 492
-/+ buffers/cache: 1488 6440
/ # free
total used free shared buffers cached
Mem: 7928 1984 5944 0 0 492
-/+ buffers/cache: 1492 6436
/ # free
total used free shared buffers cached
Mem: 7928 1988 5940 0 0 492
-/+ buffers/cache: 1496 6432
At some point, the system fails to sastisfy 256KB allocations:
This problem happens because we allocate ordered page through
__get_free_pages() in do_mmap_private() in some cases and we try to free
individual pages rather than ordered page in free_page_series(). In
this case, freeing pages whose refcount is not 0 won't be freed to the
page allocator so memory leak happens.
To fix the problem, this patch changes __get_free_pages() to
alloc_pages_exact() since alloc_pages_exact() returns
physically-contiguous pages but each pages are refcounted.
The following patch updates the Ocfs2 documentation in MAINTAINERS,
ocfs2.txt, and dlmfs.txt. I added our new official web page, changed
the location of our tools git tree and removed the link to Joel's
ancient kernel git tree - Andrew has handled our patches for a while
now.
Arnd Bergmann [Wed, 25 Feb 2015 15:31:57 +0000 (16:31 +0100)]
net: smc91x: use run-time configuration on all ARM machines
The smc91x driver traditionally gets configured at compile-time
for whichever hardware it runs on. This no longer works on
ARM as we continue to move to building all-in-one kernels.
Most ARM configurations with this driver already use run-time
configuration through DT or through platform_data, but a
few have not been converted yet.
I've checked all ARM boards that use this driver in their
legacy board files, and converted the ones that were using
compile-time configuration in smc91x.h to behave like the
other ones and provide the interrupt polarity along with
the MMIO configuration (width, stride) at platform device
creation time.
In particular, these combinations were previously selectable
in Kconfig but in fact broken:
- sa1100 assabet plus pleb
- msm combined with any other armv6/v7 platform
- pxa-idp combined with any non-DMA pxa variant
- LogicPD PXA270 combined with any other pxa
- nomadik combined with any other armv4/v5 platform,
e.g. versatile.
None of these seem critical enough to warrant a backport
to stable, but it would be nice to clean this up for good.
Signed-off-by: Arnd Bergmann <[email protected]>
----
I would like the patch to get merged through netdev, after
Robert and/or Linus have verified it on at least some hardware.
There are a few other non-ARM platforms using this driver,
I could do the same patch for those if we want to take
it further.
During the attach of this driver a couple commands are sent to the hardware
with usb_bulk_msg() to read the firmware version information. This information
is then dumped as dev_info() kernel messages. Thee messages are just added
noise and don't effect the operation of the driver.
For simplicity, remove the messages as well as the then unused functions
vmk80xx_read_eeprom() and vmk80xx_check_data_link().
This also fixes an issue reported by coverity about an out-of-bounds write
in vmk80xx_read_eeprom().
staging: comedi: comedi_isadma: fix "stalled" detect in comedi_isadma_disable_on_sample()
The "stalled" variable this function is used to detect if the DMA operation
is stalled while trying to disable DMA on a full comedi sample. The reset
of this variable should only occur when the remaining bytes of the DMA
transfer does not equal the remaining bytes from the last check.
Merge tag 'iio-fixes-for-4.0b' of git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-linus
Jonathan writes:
Second round of IIO fixes for the 4.0 cycle (or round one part two really!)
These are fixes for patches in the recent merge window and are in a separate
branch to avoid rebasing the main fixes-togreg branch.
* jsa1212 - select missing REGMAP_I2C
* ssp_common - build warning fix for PM functions when PM not in use.
* ak8975 - the addition of a utility library for this driver (as part of
adding new device support) led to a dependency not being inforced
for the original driver (I2C and GPIOLIB).
Merge tag 'iio-fixes-for-4.0a' of git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-linus
Jonathan writes:
First round of fixes for IIO in the 4.0 cycle. Note a followup
set dependent on patches in the recent merge windows will follow shortly.
* dht11 - fix a read off the end of an array, add some locking to prevent
the read function being interrupted and make sure gpio/irq lines
are not enabled for irqs during output.
* iadc - timeout should be in jiffies not msecs
* mpu6050 - avoid a null id from ACPI emumeration being dereferenced.
* mxs-lradc - fix up some interaction issues between the touchscreen driver
and iio driver. Mostly about making sure that the adc driver
only affects channels that are not being used for the
touchscreen.
* ad2s1200 - sign extension fix for a result of c type promotion.
* adis16400 - sign extension fix for a result of c type promotion.
* mcp3422 - scale table was transposed.
* ad5686 - use _optional regulator get to avoid a dummy reg being allocate
which would cause the driver to fail to initialize.
* gp2ap020a00f - select REGMAP_I2C
* si7020 - revert an incorrect cleanup up and then fix the issue that made
that cleanup seem like a good idea.
Arnd Bergmann [Wed, 28 Jan 2015 13:58:44 +0000 (14:58 +0100)]
iio: ak8975: fix AK09911 dependencies
ak8975 depends on I2C and GPIOLIB, so any symbols that selects
ak8975 must have the same dependency, or we get build errors:
drivers/iio/magnetometer/ak8975.c: In function 'ak8975_who_i_am':
drivers/iio/magnetometer/ak8975.c:393:2: error: implicit declaration of function 'i2c_smbus_read_i2c_block_data' [-Werror=implicit-function-declaration]
ret = i2c_smbus_read_i2c_block_data(client, AK09912_REG_WIA1,
^
drivers/iio/magnetometer/ak8975.c: In function 'ak8975_set_mode':
drivers/iio/magnetometer/ak8975.c:431:2: error: implicit declaration of function 'i2c_smbus_write_byte_data' [-Werror=implicit-function-declaration]
ret = i2c_smbus_write_byte_data(data->client,
Signed-off-by: Arnd Bergmann <[email protected]> Fixes: 57e73a423b1e85 ("iio: ak8975: add ak09911 and ak09912 support") Signed-off-by: Jonathan Cameron <[email protected]>
Steven Rostedt [Fri, 27 Feb 2015 19:50:19 +0000 (14:50 -0500)]
x86: Init per-cpu shadow copy of CR4 on 32-bit CPUs too
Commit:
1e02ce4cccdc ("x86: Store a per-cpu shadow copy of CR4")
added a shadow CR4 such that reads and writes that do not
modify the CR4 execute much faster than always reading the
register itself.
The change modified cpu_init() in common.c, so that the
shadow CR4 gets initialized before anything uses it.
Unfortunately, there's two cpu_init()s in common.c. There's
one for 64-bit and one for 32-bit. The commit only added
the shadow init to the 64-bit path, but the 32-bit path
needs the init too.
It is possible that _ART/_TRT tables are missing or have errors.
Ignore those failures, as INT3400 thermal zone is still required
for _OSC or mode switch.
Enable Intel Powerclamp driver on Atom* Processor C2000 Product
Family for Microservers (Avoton). Avoton - SoCs for micro-servers
has package C-states which can be used for idle injection.
Brian Norris [Wed, 18 Feb 2015 02:18:36 +0000 (18:18 -0800)]
tools/thermal: tmon: silence 'set but not used' warnings
gcc complains about the 'cols' variable being unused. This is
unavoidable, given the ncurses getmaxyx() macro-based API, which wants
to assign to a variable directly, even when we're not going to use it.
Warning:
gcc -O1 -Wall -Wshadow -W -Wformat -Wimplicit-function-declaration -Wimplicit-int -fstack-protector -D VERSION=\"1.0\" -c -o tui.o tui.c
tui.c: In function ‘show_dialogue’:
tui.c:288:12: warning: variable ‘cols’ set but not used [-Wunused-but-set-variable]
int rows, cols;
^
Brian Norris [Wed, 18 Feb 2015 02:18:35 +0000 (18:18 -0800)]
tools/thermal: tmon: use pkg-config to determine library dependencies
Some distros (e.g., Arch Linux) don't package the tinfo library
separately from ncurses, so don't unconditionally include it. Instead,
use pkg-config.
The $(STATIC) ugliness is to handle the reported build case from commit 6b533269fb25 ("tools/thermal: tmon: fix compilation errors when building
statically"), where a developer wants to be able to build with:
make LDFLAGS=-static
which requires an additional pkg-config flag.
Finally, support a lowest common denominator fallback (-lpanel
-lncurses) for build systems that don't have pkg-config entries for
ncurses.
The number of rows in the dialog vary according to the number of cooling
devices. However, some of the windowing computations were assuming a
fixed number of rows. This computation is OK when we have between 4 and
9 cooling devices (and they wrap to the next column), but with fewer
devices, we end up printing off the end of the window.
This unifies the row computation into a single function and uses that
throughout the TUI code. This also accounts for increasing the number of
rows when there are more than 9 total cooling devices.
Brian Norris [Wed, 18 Feb 2015 02:18:29 +0000 (18:18 -0800)]
tools/thermal: tmon: add --target-temp parameter
If we launch in daemon mode (--daemon), we don't have the ncurses UI,
but we might want to set the target temperature still. For example,
someone might stick the following in their boot script:
Linus Torvalds [Sat, 28 Feb 2015 00:18:33 +0000 (16:18 -0800)]
Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Arnd Bergmann:
"The arm-soc bug fixes this time around are mostly for the omap
platform, coming from a pull request from Tony Lindgren and are almost
entirely fixing dts files.
The other two changes enable support for the shmobile platform in
generic armv7 kernels and change some properties in the ARM64
reference board dts files"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: multi_v7_defconfig: Enable shmobile platforms
arm64: Add L2 cache topology to ARM Ltd boards/models
ARM: dts: am335x-bone*: usb0 is hardwired for peripheral
ARM: dts: dra7x-evm: beagle-x15: Fix USB Host
ARM: omap2plus_defconfig: Fix SATA boot
ARM: omap2plus_defconfig: Enable OMAP NAND BCH driver
ARM: dts: dra7: Correct the dma controller's property names
ARM: dts: omap5: Correct the dma controller's property names
ARM: dts: omap4: Correct the dma controller's property names
ARM: dts: omap3: Correct the dma controller's property names
ARM: dts: omap2: Correct the dma controller's property names
ARM: dts: am437x-idk: fix sleep pinctrl state
ARM: omap2plus_defconfig: enable TPS62362 regulator
ARM: dts: am437x-idk: fix TPS62362 i2c bus
ARM: dts: n900: Fix offset for smc91x ethernet
ARM: dts: n900: fix i2c bus numbering
ARM: dts: Fix USB dts configuration for dm816x
ARM: dts: OMAP5: Fix SATA PHY node
ARM: dts: DRA7: Fix SATA PHY node
Linus Torvalds [Sat, 28 Feb 2015 00:09:37 +0000 (16:09 -0800)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
"Various arm64 fixes:
- ftrace branch generation fix
- branch instruction encoding fix
- include files, guards and unused prototypes clean-up
- minor VDSO ABI fix (clock_getres)
- PSCI functions moved to .S to avoid compilation error with gcc 5
- pte_modify fix to not ignore the mapping type
- crypto: AES interleaved increased to 4x (for performance reasons)
- text patching fix for modules
- swiotlb increased back to 64MB
- copy_siginfo_to_user32() fix for big endian"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: cpuidle: add asm/proc-fns.h inclusion
arm64: compat Fix siginfo_t -> compat_siginfo_t conversion on big endian
arm64: Increase the swiotlb buffer size 64MB
arm64: Fix text patching logic when using fixmap
arm64: crypto: increase AES interleave to 4x
arm64: enable PTE type bit in the mask for pte_modify
arm64: mm: remove unused functions and variable protoypes
arm64: psci: move psci firmware calls out of line
arm64: vdso: minor ABI fix for clock_getres
arm64: guard asm/assembler.h against multiple inclusions
arm64: insn: fix compare-and-branch encodings
arm64: ftrace: fix ftrace_modify_graph_caller for branch replace
Linus Torvalds [Sat, 28 Feb 2015 00:08:45 +0000 (16:08 -0800)]
Merge tag 'renesas-sh-drivers-for-v4.0' of git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas
Pull SH driver fix from Simon Horman:
"Disable PM runtime for multi-platform r8a7740 with genpd"
* tag 'renesas-sh-drivers-for-v4.0' of git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas:
drivers: sh: Disable PM runtime for multi-platform r8a7740 with genpd
Joachim Nilsson [Wed, 25 Feb 2015 15:15:02 +0000 (16:15 +0100)]
PCI: versatile: Update for list_for_each_entry() API change
In Linux 4.0-rc1 ARM Versatile PCI build fails to build due to what
appears to be an API update. This patch is a very simple correction,
merely posted as a heads-up to the maintainers. Hopefully a better
fix can be forwarded to Linus.
[ arnd: the patch actually looks correct, so let's take this version ]
David S. Miller [Fri, 27 Feb 2015 22:48:17 +0000 (17:48 -0500)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2015-02-26
This series contains fixes for i40e and i40evf only.
Alexey Khoroshilov found a possible leak of 'cmd_buf' when copy_from_user()
failed in i40e_dbg_command_write(), so resolved by calling kfree().
Shannon provides a fix to ensure the shift and bitwise precedences do not
work backwards for us by adding parans. Fixed the driver by preventing
the driver from allowing stray interrupts or causing system logs from
un-handled interrupts by combining the ICR0 shutdown with the standard
interrupt shutdown and add the interrupt clearing to the PCI shutdown
path. Fixed an issue where a NVM write times out before a transaction
can complete, so Shannon added logic to make another attempt by
reacquiring the semaphore, then retry the write, if the one retry fails,
we will then give up. Adds checks to pointers before their use to ensure
we do not try to dereference NULL pointers when returning values from the
AdminQ calls.
Akeem adds a check to bail out if the device is already down when checking
for Tx hang subtask.
Anjali fixes TSO with more than 8 frags per segment issue. The hardware
has some limitations which the driver needs to adhere to:
1) no more than 8 descriptors per packet on the wire
2) no header can span more than 3 descriptors
If one of these events happens, the hardware will generate an internal
error and freeze the Tx queue, so Anjali fixes this by linearizes the skb
to avoid these situations. Fixed an issue where the per Traffic Class
queue count was higher than queues enabled, which will fix a warning
with multiple function mode where systems regularly have more cores than
vectors. Fixed TCP/IPv6 over VXLAN Tx checksum offload, where we were
checking the outer protocol flags and deciding the flow for the inner
header.
Jesse fixes a race condition in the transmit hang detection. Before we
were having issues of false Tx hang detection, no the driver makes more
direct with the checks for progress forward by directly checking the head
write back address and tail register when determining progress. This
avoids Tx hangs where the software gets behind, because we are directly
checking hardware state when determining a hang state.
Neerav fixes the transmit ring Qset handle when DCB reconfigures. The issue
was when DCB is reconfigured to a single traffic class (TC) and the driver
did not reset the Tx ring Qset handle to correct the mapping, which caused
the Tx queue to disable timeouts. Also as part of DCB reconfiguration flow
if the Tx queue disable times out, then issue a PF reset to do some level
of recovery.
Mitch stops flow director on shutdown because, in some cases, the hardware
would continue to try to access the FDIR ring after entering D3Hot state,
which would cause either PCIe errors or NMIs, depending upon the system
configuration.
* NOTE * I have verified that this series of patches for net will not cause
any merge issues when you sync up your net tree with your net-next tree.
====================
Lendacky, Thomas [Wed, 25 Feb 2015 19:50:12 +0000 (13:50 -0600)]
amd-xgbe: Request IRQs only after driver is fully setup
It is possible that the hardware may not have been properly shutdown
before this driver gets control, through use by firmware, for example.
Until the driver is loaded, interrupts associated with the hardware
could go pending. When the IRQs are requested napi support has not
been initialized yet, but the ISR will get control and schedule napi
processing resulting in a kernel panic because the poll routine has not
been set.
Adjust the code so that the driver is fully ready to handle and process
interrupts as soon as the IRQs are requested. This involves requesting
and freeing IRQs during start and stop processing and ordering the napi
add and delete calls appropriately.
Also adjust the powerup and powerdown routines to match the start and
stop routines in regards to the ordering of tasks, including napi
related calls.
Trond Myklebust [Fri, 27 Feb 2015 22:04:17 +0000 (17:04 -0500)]
NFSv4: nfs4_open_recover_helper() must set share access
The share access mode is now specified as an argument in the nfs4_opendata,
and so nfs4_open_recover_helper() needs to call nfs4_map_atomic_open_share()
in order to set it.
Fixes: 6ae373394c42 ("NFSv4.1: Ask for no delegation on OPEN if using O_DIRECT") Signed-off-by: Trond Myklebust <[email protected]>
David S. Miller [Fri, 27 Feb 2015 21:06:21 +0000 (16:06 -0500)]
Merge branch 'rhashtable'
Daniel Borkmann says:
====================
rhashtable updates
As discussed, I'm sending out rhashtable fixups for -net.
I have a couple of more patches I was working on last week pending,
i.e. to get rid of ht->nelems and ht->shift atomic operations which
speed-up pure insertions/deletions, e.g. on my laptop I have 2 threads,
inserting 7M entries each, that will reduce insertion time from ~1,450 ms
to 865 ms (performance should even be better after removing the
grow/shrink indirections). I guess that however is rather something
for net-next.
====================
Daniel Borkmann [Wed, 25 Feb 2015 15:31:54 +0000 (16:31 +0100)]
rhashtable: remove indirection for grow/shrink decision functions
Currently, all real users of rhashtable default their grow and shrink
decision functions to rht_grow_above_75() and rht_shrink_below_30(),
so that there's currently no need to have this explicitly selectable.
It can/should be generic and private inside rhashtable until a real
use case pops up. Since we can make this private, we'll save us this
additional indirection layer and can improve insertion/deletion time
as well.
Daniel Borkmann [Wed, 25 Feb 2015 15:31:53 +0000 (16:31 +0100)]
rhashtable: unconditionally grow when max_shift is not specified
While commit c0c09bfdc415 ("rhashtable: avoid unnecessary wakeup for
worker queue") rightfully moved part of the decision making of
whether we should expand or shrink from the expand/shrink functions
themselves into insert/delete functions in order to avoid unnecessary
worker wake-ups, it however introduced a regression by doing so.
Before that change, if no max_shift was specified (= 0) on rhashtable
initialization, rhashtable_expand() would just grow unconditionally
and lets the available memory be the limiting factor. After that
change, if no max_shift was specified, there would be _no_ expansion
step at all.
Given that netlink and tipc have a max_shift specified, it was not
visible there, but Josh Hunt reported that if nft that starts out
with a default element hint of 3 if not otherwise provided, would
slow i.e. inserts down trememdously as it cannot grow larger to
relax table occupancy.
Given that the test case verifies shrinks/expands manually, we also
must remove pointer to the helper functions to explicitly avoid
parallel resizing on insertions/deletions. test_bucket_stats() and
test_rht_lookup() could also be wrapped around rhashtable mutex to
explicitly synchronize a walk from resizing, but I think that defeats
the actual test case which intended to have explicit test steps,
i.e. 1) inserts, 2) expands, 3) shrinks, 4) deletions, with object
verification after each stage.
The 2 that we use for copy_to_iter comes from sizeof(u16),
it used to be that way before the iov iter update.
Fix it up, making it obvious the size of stack access
is right.
Recent iterator-related changes in vhost made it
harder to follow the logic fixing up the header.
In fact, the fixup always happens at the same
offset: sizeof(virtio_net_hdr): sometimes the
fixup iterator is updated by copy_to_iter,
sometimes-by iov_iter_advance.
Dan Carpenter [Wed, 25 Feb 2015 13:36:12 +0000 (16:36 +0300)]
rocker: silence shift wrapping warning
"val" is declared as a u64 so static checkers complain that this shift
can wrap. I don't have the hardware but probably it's doesn't have over
31 ports. Still we may as well silence the warning even if it's not a
real bug.