| CVE |
Vendors |
Products |
Updated |
CVSS v3.1 |
| In the Linux kernel, the following vulnerability has been resolved:
io_uring/net: don't overflow multishot recv
Don't allow overflowing multishot recv CQEs, it might get out of
hand, hurt performance, and in the worst case scenario OOM the task. |
| In the Linux kernel, the following vulnerability has been resolved:
iio: core: Prevent invalid memory access when there is no parent
Commit 813665564b3d ("iio: core: Convert to use firmware node handle
instead of OF node") switched the kind of nodes to use for label
retrieval in device registration. Probably an unwanted change in that
commit was that if the device has no parent then NULL pointer is
accessed. This is what happens in the stock IIO dummy driver when a
new entry is created in configfs:
# mkdir /sys/kernel/config/iio/devices/dummy/foo
BUG: kernel NULL pointer dereference, address: ...
...
Call Trace:
__iio_device_register
iio_dummy_probe
Since there seems to be no reason to make a parent device of an IIO
dummy device mandatory, let’s prevent the invalid memory access in
__iio_device_register when the parent device is NULL. With this
change, the IIO dummy driver works fine with configfs. |
| In the Linux kernel, the following vulnerability has been resolved:
thermal: of: fix double-free on unregistration
Since commit 3d439b1a2ad3 ("thermal/core: Alloc-copy-free the thermal
zone parameters structure"), thermal_zone_device_register() allocates
a copy of the tzp argument and frees it when unregistering, so
thermal_of_zone_register() now ends up leaking its original tzp and
double-freeing the tzp copy. Fix this by locating tzp on stack instead. |
| In the Linux kernel, the following vulnerability has been resolved:
ping: Fix potentail NULL deref for /proc/net/icmp.
After commit dbca1596bbb0 ("ping: convert to RCU lookups, get rid
of rwlock"), we use RCU for ping sockets, but we should use spinlock
for /proc/net/icmp to avoid a potential NULL deref mentioned in
the previous patch.
Let's go back to using spinlock there.
Note we can convert ping sockets to use hlist instead of hlist_nulls
because we do not use SLAB_TYPESAFE_BY_RCU for ping sockets. |
| In the Linux kernel, the following vulnerability has been resolved:
vmci_host: fix a race condition in vmci_host_poll() causing GPF
During fuzzing, a general protection fault is observed in
vmci_host_poll().
general protection fault, probably for non-canonical address 0xdffffc0000000019: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x00000000000000c8-0x00000000000000cf]
RIP: 0010:__lock_acquire+0xf3/0x5e00 kernel/locking/lockdep.c:4926
<- omitting registers ->
Call Trace:
<TASK>
lock_acquire+0x1a4/0x4a0 kernel/locking/lockdep.c:5672
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0xb3/0x100 kernel/locking/spinlock.c:162
add_wait_queue+0x3d/0x260 kernel/sched/wait.c:22
poll_wait include/linux/poll.h:49 [inline]
vmci_host_poll+0xf8/0x2b0 drivers/misc/vmw_vmci/vmci_host.c:174
vfs_poll include/linux/poll.h:88 [inline]
do_pollfd fs/select.c:873 [inline]
do_poll fs/select.c:921 [inline]
do_sys_poll+0xc7c/0x1aa0 fs/select.c:1015
__do_sys_ppoll fs/select.c:1121 [inline]
__se_sys_ppoll+0x2cc/0x330 fs/select.c:1101
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x4e/0xa0 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Example thread interleaving that causes the general protection fault
is as follows:
CPU1 (vmci_host_poll) CPU2 (vmci_host_do_init_context)
----- -----
// Read uninitialized context
context = vmci_host_dev->context;
// Initialize context
vmci_host_dev->context = vmci_ctx_create();
vmci_host_dev->ct_type = VMCIOBJ_CONTEXT;
if (vmci_host_dev->ct_type == VMCIOBJ_CONTEXT) {
// Dereferencing the wrong pointer
poll_wait(..., &context->host_context);
}
In this scenario, vmci_host_poll() reads vmci_host_dev->context first,
and then reads vmci_host_dev->ct_type to check that
vmci_host_dev->context is initialized. However, since these two reads
are not atomically executed, there is a chance of a race condition as
described above.
To fix this race condition, read vmci_host_dev->context after checking
the value of vmci_host_dev->ct_type so that vmci_host_poll() always
reads an initialized context. |
| In the Linux kernel, the following vulnerability has been resolved:
KVM: Destroy target device if coalesced MMIO unregistration fails
Destroy and free the target coalesced MMIO device if unregistering said
device fails. As clearly noted in the code, kvm_io_bus_unregister_dev()
does not destroy the target device.
BUG: memory leak
unreferenced object 0xffff888112a54880 (size 64):
comm "syz-executor.2", pid 5258, jiffies 4297861402 (age 14.129s)
hex dump (first 32 bytes):
38 c7 67 15 00 c9 ff ff 38 c7 67 15 00 c9 ff ff 8.g.....8.g.....
e0 c7 e1 83 ff ff ff ff 00 30 67 15 00 c9 ff ff .........0g.....
backtrace:
[<0000000006995a8a>] kmalloc include/linux/slab.h:556 [inline]
[<0000000006995a8a>] kzalloc include/linux/slab.h:690 [inline]
[<0000000006995a8a>] kvm_vm_ioctl_register_coalesced_mmio+0x8e/0x3d0 arch/x86/kvm/../../../virt/kvm/coalesced_mmio.c:150
[<00000000022550c2>] kvm_vm_ioctl+0x47d/0x1600 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3323
[<000000008a75102f>] vfs_ioctl fs/ioctl.c:46 [inline]
[<000000008a75102f>] file_ioctl fs/ioctl.c:509 [inline]
[<000000008a75102f>] do_vfs_ioctl+0xbab/0x1160 fs/ioctl.c:696
[<0000000080e3f669>] ksys_ioctl+0x76/0xa0 fs/ioctl.c:713
[<0000000059ef4888>] __do_sys_ioctl fs/ioctl.c:720 [inline]
[<0000000059ef4888>] __se_sys_ioctl fs/ioctl.c:718 [inline]
[<0000000059ef4888>] __x64_sys_ioctl+0x6f/0xb0 fs/ioctl.c:718
[<000000006444fa05>] do_syscall_64+0x9f/0x4e0 arch/x86/entry/common.c:290
[<000000009a4ed50b>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
BUG: leak checking failed |
| In the Linux kernel, the following vulnerability has been resolved:
wifi: rsi: Do not configure WoWlan in shutdown hook if not enabled
In case WoWlan was never configured during the operation of the system,
the hw->wiphy->wowlan_config will be NULL. rsi_config_wowlan() checks
whether wowlan_config is non-NULL and if it is not, then WARNs about it.
The warning is valid, as during normal operation the rsi_config_wowlan()
should only ever be called with non-NULL wowlan_config. In shutdown this
rsi_config_wowlan() should only ever be called if WoWlan was configured
before by the user.
Add checks for non-NULL wowlan_config into the shutdown hook. While at it,
check whether the wiphy is also non-NULL before accessing wowlan_config .
Drop the single-use wowlan_config variable, just inline it into function
call. |
| In the Linux kernel, the following vulnerability has been resolved:
bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps
The LRU and LRU_PERCPU maps allocate a new element on update before locking the
target hash table bucket. Right after that the maps try to lock the bucket.
If this fails, then maps return -EBUSY to the caller without releasing the
allocated element. This makes the element untracked: it doesn't belong to
either of free lists, and it doesn't belong to the hash table, so can't be
re-used; this eventually leads to the permanent -ENOMEM on LRU map updates,
which is unexpected. Fix this by returning the element to the local free list
if bucket locking fails. |
| In the Linux kernel, the following vulnerability has been resolved:
opp: Fix use-after-free in lazy_opp_tables after probe deferral
When dev_pm_opp_of_find_icc_paths() in _allocate_opp_table() returns
-EPROBE_DEFER, the opp_table is freed again, to wait until all the
interconnect paths are available.
However, if the OPP table is using required-opps then it may already
have been added to the global lazy_opp_tables list. The error path
does not remove the opp_table from the list again.
This can cause crashes later when the provider of the required-opps
is added, since we will iterate over OPP tables that have already been
freed. E.g.:
Unable to handle kernel NULL pointer dereference when read
CPU: 0 PID: 7 Comm: kworker/0:0 Not tainted 6.4.0-rc3
PC is at _of_add_opp_table_v2 (include/linux/of.h:949
drivers/opp/of.c:98 drivers/opp/of.c:344 drivers/opp/of.c:404
drivers/opp/of.c:1032) -> lazy_link_required_opp_table()
Fix this by calling _of_clear_opp_table() to remove the opp_table from
the list and clear other allocated resources. While at it, also add the
missing mutex_destroy() calls in the error path. |
| In the Linux kernel, the following vulnerability has been resolved:
ice: prevent NULL pointer deref during reload
Calling ethtool during reload can lead to call trace, because VSI isn't
configured for some time, but netdev is alive.
To fix it add rtnl lock for VSI deconfig and config. Set ::num_q_vectors
to 0 after freeing and add a check for ::tx/rx_rings in ring related
ethtool ops.
Add proper unroll of filters in ice_start_eth().
Reproduction:
$watch -n 0.1 -d 'ethtool -g enp24s0f0np0'
$devlink dev reload pci/0000:18:00.0 action driver_reinit
Call trace before fix:
[66303.926205] BUG: kernel NULL pointer dereference, address: 0000000000000000
[66303.926259] #PF: supervisor read access in kernel mode
[66303.926286] #PF: error_code(0x0000) - not-present page
[66303.926311] PGD 0 P4D 0
[66303.926332] Oops: 0000 [#1] PREEMPT SMP PTI
[66303.926358] CPU: 4 PID: 933821 Comm: ethtool Kdump: loaded Tainted: G OE 6.4.0-rc5+ #1
[66303.926400] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.00.01.0014.070920180847 07/09/2018
[66303.926446] RIP: 0010:ice_get_ringparam+0x22/0x50 [ice]
[66303.926649] Code: 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 87 c0 09 00 00 c7 46 04 e0 1f 00 00 c7 46 10 e0 1f 00 00 48 8b 50 20 <48> 8b 12 0f b7 52 3a 89 56 14 48 8b 40 28 48 8b 00 0f b7 40 58 48
[66303.926722] RSP: 0018:ffffad40472f39c8 EFLAGS: 00010246
[66303.926749] RAX: ffff98a8ada05828 RBX: ffff98a8c46dd060 RCX: ffffad40472f3b48
[66303.926781] RDX: 0000000000000000 RSI: ffff98a8c46dd068 RDI: ffff98a8b23c4000
[66303.926811] RBP: ffffad40472f3b48 R08: 00000000000337b0 R09: 0000000000000000
[66303.926843] R10: 0000000000000001 R11: 0000000000000100 R12: ffff98a8b23c4000
[66303.926874] R13: ffff98a8c46dd060 R14: 000000000000000f R15: ffffad40472f3a50
[66303.926906] FS: 00007f6397966740(0000) GS:ffff98b390900000(0000) knlGS:0000000000000000
[66303.926941] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[66303.926967] CR2: 0000000000000000 CR3: 000000011ac20002 CR4: 00000000007706e0
[66303.926999] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[66303.927029] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[66303.927060] PKRU: 55555554
[66303.927075] Call Trace:
[66303.927094] <TASK>
[66303.927111] ? __die+0x23/0x70
[66303.927140] ? page_fault_oops+0x171/0x4e0
[66303.927176] ? exc_page_fault+0x7f/0x180
[66303.927209] ? asm_exc_page_fault+0x26/0x30
[66303.927244] ? ice_get_ringparam+0x22/0x50 [ice]
[66303.927433] rings_prepare_data+0x62/0x80
[66303.927469] ethnl_default_doit+0xe2/0x350
[66303.927501] genl_family_rcv_msg_doit.isra.0+0xe3/0x140
[66303.927538] genl_rcv_msg+0x1b1/0x2c0
[66303.927561] ? __pfx_ethnl_default_doit+0x10/0x10
[66303.927590] ? __pfx_genl_rcv_msg+0x10/0x10
[66303.927615] netlink_rcv_skb+0x58/0x110
[66303.927644] genl_rcv+0x28/0x40
[66303.927665] netlink_unicast+0x19e/0x290
[66303.927691] netlink_sendmsg+0x254/0x4d0
[66303.927717] sock_sendmsg+0x93/0xa0
[66303.927743] __sys_sendto+0x126/0x170
[66303.927780] __x64_sys_sendto+0x24/0x30
[66303.928593] do_syscall_64+0x5d/0x90
[66303.929370] ? __count_memcg_events+0x60/0xa0
[66303.930146] ? count_memcg_events.constprop.0+0x1a/0x30
[66303.930920] ? handle_mm_fault+0x9e/0x350
[66303.931688] ? do_user_addr_fault+0x258/0x740
[66303.932452] ? exc_page_fault+0x7f/0x180
[66303.933193] entry_SYSCALL_64_after_hwframe+0x72/0xdc |
| In the Linux kernel, the following vulnerability has been resolved:
wifi: rtl8xxxu: Fix memory leaks with RTL8723BU, RTL8192EU
The wifi + bluetooth combo chip RTL8723BU can leak memory (especially?)
when it's connected to a bluetooth audio device. The busy bluetooth
traffic generates lots of C2H (card to host) messages, which are not
freed correctly.
To fix this, move the dev_kfree_skb() call in rtl8xxxu_c2hcmd_callback()
inside the loop where skb_dequeue() is called.
The RTL8192EU leaks memory because the C2H messages are added to the
queue and left there forever. (This was fine in the past because it
probably wasn't sending any C2H messages until commit e542e66b7c2e
("wifi: rtl8xxxu: gen2: Turn on the rate control"). Since that commit
it sends a C2H message when the TX rate changes.)
To fix this, delete the check for rf_paths > 1 and the goto. Let the
function process the C2H messages from RTL8192EU like the ones from
the other chips.
Theoretically the RTL8188FU could also leak like RTL8723BU, but it
most likely doesn't send C2H messages frequently enough.
This change was tested with RTL8723BU by Erhard F. I tested it with
RTL8188FU and RTL8192EU. |
| In the Linux kernel, the following vulnerability has been resolved:
ionic: remove WARN_ON to prevent panic_on_warn
Remove unnecessary early code development check and the WARN_ON
that it uses. The irq alloc and free paths have long been
cleaned up and this check shouldn't have stuck around so long. |
| In the Linux kernel, the following vulnerability has been resolved:
netfilter: nf_tables: fix underflow in chain reference counter
Set element addition error path decrements reference counter on chains
twice: once on element release and again via nft_data_release().
Then, d6b478666ffa ("netfilter: nf_tables: fix underflow in object
reference counter") incorrectly fixed this by removing the stateful
object reference count decrement.
Restore the stateful object decrement as in b91d90368837 ("netfilter:
nf_tables: fix leaking object reference count") and let
nft_data_release() decrement the chain reference counter, so this is
done only once. |
| In the Linux kernel, the following vulnerability has been resolved:
ice: fix wrong fallback logic for FDIR
When adding a FDIR filter, if ice_vc_fdir_set_irq_ctx returns failure,
the inserted fdir entry will not be removed and if ice_vc_fdir_write_fltr
returns failure, the fdir context info for irq handler will not be cleared
which may lead to inconsistent or memory leak issue. This patch refines
failure cases to resolve this issue. |
| In the Linux kernel, the following vulnerability has been resolved:
net: ipv4: fix one memleak in __inet_del_ifa()
I got the below warning when do fuzzing test:
unregister_netdevice: waiting for bond0 to become free. Usage count = 2
It can be repoduced via:
ip link add bond0 type bond
sysctl -w net.ipv4.conf.bond0.promote_secondaries=1
ip addr add 4.117.174.103/0 scope 0x40 dev bond0
ip addr add 192.168.100.111/255.255.255.254 scope 0 dev bond0
ip addr add 0.0.0.4/0 scope 0x40 secondary dev bond0
ip addr del 4.117.174.103/0 scope 0x40 dev bond0
ip link delete bond0 type bond
In this reproduction test case, an incorrect 'last_prim' is found in
__inet_del_ifa(), as a result, the secondary address(0.0.0.4/0 scope 0x40)
is lost. The memory of the secondary address is leaked and the reference of
in_device and net_device is leaked.
Fix this problem:
Look for 'last_prim' starting at location of the deleted IP and inserting
the promoted IP into the location of 'last_prim'. |
| In the Linux kernel, the following vulnerability has been resolved:
dmaengine: sf-pdma: pdma_desc memory leak fix
Commit b2cc5c465c2c ("dmaengine: sf-pdma: Add multithread support for a
DMA channel") changed sf_pdma_prep_dma_memcpy() to unconditionally
allocate a new sf_pdma_desc each time it is called.
The driver previously recycled descs, by checking the in_use flag, only
allocating additional descs if the existing one was in use. This logic
was removed in commit b2cc5c465c2c ("dmaengine: sf-pdma: Add multithread
support for a DMA channel"), but sf_pdma_free_desc() was not changed to
handle the new behaviour.
As a result, each time sf_pdma_prep_dma_memcpy() is called, the previous
descriptor is leaked, over time leading to memory starvation:
unreferenced object 0xffffffe008447300 (size 192):
comm "irq/39-mchp_dsc", pid 343, jiffies 4294906910 (age 981.200s)
hex dump (first 32 bytes):
00 00 00 ff 00 00 00 00 b8 c1 00 00 00 00 00 00 ................
00 00 70 08 10 00 00 00 00 00 00 c0 00 00 00 00 ..p.............
backtrace:
[<00000000064a04f4>] kmemleak_alloc+0x1e/0x28
[<00000000018927a7>] kmem_cache_alloc+0x11e/0x178
[<000000002aea8d16>] sf_pdma_prep_dma_memcpy+0x40/0x112
Add the missing kfree() to sf_pdma_free_desc(), and remove the redundant
in_use flag. |
| In the Linux kernel, the following vulnerability has been resolved:
can: j1939: j1939_tp_tx_dat_new(): fix out-of-bounds memory access
In the j1939_tp_tx_dat_new() function, an out-of-bounds memory access
could occur during the memcpy() operation if the size of skb->cb is
larger than the size of struct j1939_sk_buff_cb. This is because the
memcpy() operation uses the size of skb->cb, leading to a read beyond
the struct j1939_sk_buff_cb.
Updated the memcpy() operation to use the size of struct
j1939_sk_buff_cb instead of the size of skb->cb. This ensures that the
memcpy() operation only reads the memory within the bounds of struct
j1939_sk_buff_cb, preventing out-of-bounds memory access.
Additionally, add a BUILD_BUG_ON() to check that the size of skb->cb
is greater than or equal to the size of struct j1939_sk_buff_cb. This
ensures that the skb->cb buffer is large enough to hold the
j1939_sk_buff_cb structure.
[mkl: rephrase commit message] |
| In the Linux kernel, the following vulnerability has been resolved:
virtio_vdpa: build affinity masks conditionally
We try to build affinity mask via create_affinity_masks()
unconditionally which may lead several issues:
- the affinity mask is not used for parent without affinity support
(only VDUSE support the affinity now)
- the logic of create_affinity_masks() might not work for devices
other than block. For example it's not rare in the networking device
where the number of queues could exceed the number of CPUs. Such
case breaks the current affinity logic which is based on
group_cpus_evenly() who assumes the number of CPUs are not less than
the number of groups. This can trigger a warning[1]:
if (ret >= 0)
WARN_ON(nr_present + nr_others < numgrps);
Fixing this by only build the affinity masks only when
- Driver passes affinity descriptor, driver like virtio-blk can make
sure to limit the number of queues when it exceeds the number of CPUs
- Parent support affinity setting config ops
This help to avoid the warning. More optimizations could be done on
top.
[1]
[ 682.146655] WARNING: CPU: 6 PID: 1550 at lib/group_cpus.c:400 group_cpus_evenly+0x1aa/0x1c0
[ 682.146668] CPU: 6 PID: 1550 Comm: vdpa Not tainted 6.5.0-rc5jason+ #79
[ 682.146671] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[ 682.146673] RIP: 0010:group_cpus_evenly+0x1aa/0x1c0
[ 682.146676] Code: 4c 89 e0 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc e8 1b c4 74 ff 48 89 ef e8 13 ac 98 ff 4c 89 e7 45 31 e4 e8 08 ac 98 ff eb c2 <0f> 0b eb b6 e8 fd 05 c3 00 45 31 e4 eb e5 cc cc cc cc cc cc cc cc
[ 682.146679] RSP: 0018:ffffc9000215f498 EFLAGS: 00010293
[ 682.146682] RAX: 000000000001f1e0 RBX: 0000000000000041 RCX: 0000000000000000
[ 682.146684] RDX: ffff888109922058 RSI: 0000000000000041 RDI: 0000000000000030
[ 682.146686] RBP: ffff888109922058 R08: ffffc9000215f498 R09: ffffc9000215f4a0
[ 682.146687] R10: 00000000000198d0 R11: 0000000000000030 R12: ffff888107e02800
[ 682.146689] R13: 0000000000000030 R14: 0000000000000030 R15: 0000000000000041
[ 682.146692] FS: 00007fef52315740(0000) GS:ffff888237380000(0000) knlGS:0000000000000000
[ 682.146695] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 682.146696] CR2: 00007fef52509000 CR3: 0000000110dbc004 CR4: 0000000000370ee0
[ 682.146698] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 682.146700] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 682.146701] Call Trace:
[ 682.146703] <TASK>
[ 682.146705] ? __warn+0x7b/0x130
[ 682.146709] ? group_cpus_evenly+0x1aa/0x1c0
[ 682.146712] ? report_bug+0x1c8/0x1e0
[ 682.146717] ? handle_bug+0x3c/0x70
[ 682.146721] ? exc_invalid_op+0x14/0x70
[ 682.146723] ? asm_exc_invalid_op+0x16/0x20
[ 682.146727] ? group_cpus_evenly+0x1aa/0x1c0
[ 682.146729] ? group_cpus_evenly+0x15c/0x1c0
[ 682.146731] create_affinity_masks+0xaf/0x1a0
[ 682.146735] virtio_vdpa_find_vqs+0x83/0x1d0
[ 682.146738] ? __pfx_default_calc_sets+0x10/0x10
[ 682.146742] virtnet_find_vqs+0x1f0/0x370
[ 682.146747] virtnet_probe+0x501/0xcd0
[ 682.146749] ? vp_modern_get_status+0x12/0x20
[ 682.146751] ? get_cap_addr.isra.0+0x10/0xc0
[ 682.146754] virtio_dev_probe+0x1af/0x260
[ 682.146759] really_probe+0x1a5/0x410 |
| In the Linux kernel, the following vulnerability has been resolved:
vdpa_sim: fix possible memory leak in vdpasim_net_init() and vdpasim_blk_init()
Inject fault while probing module, if device_register() fails in
vdpasim_net_init() or vdpasim_blk_init(), but the refcount of kobject is
not decreased to 0, the name allocated in dev_set_name() is leaked.
Fix this by calling put_device(), so that name can be freed in
callback function kobject_cleanup().
(vdpa_sim_net)
unreferenced object 0xffff88807eebc370 (size 16):
comm "modprobe", pid 3848, jiffies 4362982860 (age 18.153s)
hex dump (first 16 bytes):
76 64 70 61 73 69 6d 5f 6e 65 74 00 6b 6b 6b a5 vdpasim_net.kkk.
backtrace:
[<ffffffff8174f19e>] __kmalloc_node_track_caller+0x4e/0x150
[<ffffffff81731d53>] kstrdup+0x33/0x60
[<ffffffff83a5d421>] kobject_set_name_vargs+0x41/0x110
[<ffffffff82d87aab>] dev_set_name+0xab/0xe0
[<ffffffff82d91a23>] device_add+0xe3/0x1a80
[<ffffffffa0270013>] 0xffffffffa0270013
[<ffffffff81001c27>] do_one_initcall+0x87/0x2e0
[<ffffffff813739cb>] do_init_module+0x1ab/0x640
[<ffffffff81379d20>] load_module+0x5d00/0x77f0
[<ffffffff8137bc40>] __do_sys_finit_module+0x110/0x1b0
[<ffffffff83c4d505>] do_syscall_64+0x35/0x80
[<ffffffff83e0006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
(vdpa_sim_blk)
unreferenced object 0xffff8881070c1250 (size 16):
comm "modprobe", pid 6844, jiffies 4364069319 (age 17.572s)
hex dump (first 16 bytes):
76 64 70 61 73 69 6d 5f 62 6c 6b 00 6b 6b 6b a5 vdpasim_blk.kkk.
backtrace:
[<ffffffff8174f19e>] __kmalloc_node_track_caller+0x4e/0x150
[<ffffffff81731d53>] kstrdup+0x33/0x60
[<ffffffff83a5d421>] kobject_set_name_vargs+0x41/0x110
[<ffffffff82d87aab>] dev_set_name+0xab/0xe0
[<ffffffff82d91a23>] device_add+0xe3/0x1a80
[<ffffffffa0220013>] 0xffffffffa0220013
[<ffffffff81001c27>] do_one_initcall+0x87/0x2e0
[<ffffffff813739cb>] do_init_module+0x1ab/0x640
[<ffffffff81379d20>] load_module+0x5d00/0x77f0
[<ffffffff8137bc40>] __do_sys_finit_module+0x110/0x1b0
[<ffffffff83c4d505>] do_syscall_64+0x35/0x80
[<ffffffff83e0006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 |
| In the Linux kernel, the following vulnerability has been resolved:
btrfs: fix racy bitfield write in btrfs_clear_space_info_full()
From the memory-barriers.txt document regarding memory barrier ordering
guarantees:
(*) These guarantees do not apply to bitfields, because compilers often
generate code to modify these using non-atomic read-modify-write
sequences. Do not attempt to use bitfields to synchronize parallel
algorithms.
(*) Even in cases where bitfields are protected by locks, all fields
in a given bitfield must be protected by one lock. If two fields
in a given bitfield are protected by different locks, the compiler's
non-atomic read-modify-write sequences can cause an update to one
field to corrupt the value of an adjacent field.
btrfs_space_info has a bitfield sharing an underlying word consisting of
the fields full, chunk_alloc, and flush:
struct btrfs_space_info {
struct btrfs_fs_info * fs_info; /* 0 8 */
struct btrfs_space_info * parent; /* 8 8 */
...
int clamp; /* 172 4 */
unsigned int full:1; /* 176: 0 4 */
unsigned int chunk_alloc:1; /* 176: 1 4 */
unsigned int flush:1; /* 176: 2 4 */
...
Therefore, to be safe from parallel read-modify-writes losing a write to
one of the bitfield members protected by a lock, all writes to all the
bitfields must use the lock. They almost universally do, except for
btrfs_clear_space_info_full() which iterates over the space_infos and
writes out found->full = 0 without a lock.
Imagine that we have one thread completing a transaction in which we
finished deleting a block_group and are thus calling
btrfs_clear_space_info_full() while simultaneously the data reclaim
ticket infrastructure is running do_async_reclaim_data_space():
T1 T2
btrfs_commit_transaction
btrfs_clear_space_info_full
data_sinfo->full = 0
READ: full:0, chunk_alloc:0, flush:1
do_async_reclaim_data_space(data_sinfo)
spin_lock(&space_info->lock);
if(list_empty(tickets))
space_info->flush = 0;
READ: full: 0, chunk_alloc:0, flush:1
MOD/WRITE: full: 0, chunk_alloc:0, flush:0
spin_unlock(&space_info->lock);
return;
MOD/WRITE: full:0, chunk_alloc:0, flush:1
and now data_sinfo->flush is 1 but the reclaim worker has exited. This
breaks the invariant that flush is 0 iff there is no work queued or
running. Once this invariant is violated, future allocations that go
into __reserve_bytes() will add tickets to space_info->tickets but will
see space_info->flush is set to 1 and not queue the work. After this,
they will block forever on the resulting ticket, as it is now impossible
to kick the worker again.
I also confirmed by looking at the assembly of the affected kernel that
it is doing RMW operations. For example, to set the flush (3rd) bit to 0,
the assembly is:
andb $0xfb,0x60(%rbx)
and similarly for setting the full (1st) bit to 0:
andb $0xfe,-0x20(%rax)
So I think this is really a bug on practical systems. I have observed
a number of systems in this exact state, but am currently unable to
reproduce it.
Rather than leaving this footgun lying around for the future, take
advantage of the fact that there is room in the struct anyway, and that
it is already quite large and simply change the three bitfield members to
bools. This avoids writes to space_info->full having any effect on
---truncated--- |