date: 2022-02-10 17:25:08
categories: linux
firstPublishDate: 2022-02-10 17:25:08
Last night my backup server was in panic during a zfs recv command, I'm running Debian bullseye:
Linux 5.10.0-11-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) x86_64 GNU/Linux Source: zfs-linux Version: 2.0.3-9
Feb 10 00:20:41 kernel: [5032630.764473] VERIFY3(sa.sa_magic == SA_MAGIC) failed (2576474 == 3100762) Feb 10 00:20:41 kernel: [5032630.764474] VERIFY3(sa.sa_magic == SA_MAGIC) failed (2576474 == 3100762) Feb 10 00:20:41 kernel: [5032630.764476] PANIC at zfs_quota.c:89:zpl_get_file_info() Feb 10 00:20:41 kernel: [5032630.764480] PANIC at zfs_quota.c:89:zpl_get_file_info() Feb 10 00:20:41 kernel: [5032630.764482] Showing stack for process 2333 Feb 10 00:20:41 kernel: [5032630.764482] Showing stack for process 2327 Feb 10 00:20:41 kernel: [5032630.764484] CPU: 3 PID: 2327 Comm: dp_sync_taskq Tainted: P OE 5.10.0-9-amd64 #1 Debian 5.10.70-1 Feb 10 00:20:41 kernel: [5032630.764485] Hardware name: ASUS System Product Name/ROG STRIX B460-I GAMING, BIOS 0211 03/13/2020 Feb 10 00:20:41 kernel: [5032630.764486] Call Trace: Feb 10 00:20:41 kernel: [5032630.764496] dump_stack+0x6b/0x83 Feb 10 00:20:41 kernel: [5032630.764508] spl_panic+0xd4/0xfc [spl] Feb 10 00:20:41 kernel: [5032630.764617] ? dbuf_sync_leaf+0x44f/0x590 [zfs] Feb 10 00:20:41 kernel: [5032630.764667] zpl_get_file_info+0x9b/0x220 [zfs] Feb 10 00:20:41 kernel: [5032630.764687] dmu_objset_userquota_get_ids+0x11d/0x490 [zfs] Feb 10 00:20:41 kernel: [5032630.764687] VERIFY3(sa.sa_magic == SA_MAGIC) failed (2576474 == 3100762) Feb 10 00:20:41 kernel: [5032630.764688] PANIC at zfs_quota.c:89:zpl_get_file_info() Feb 10 00:20:41 kernel: [5032630.764689] Showing stack for process 2328 Feb 10 00:20:41 kernel: [5032630.764710] dnode_sync+0x11a/0xa30 [zfs] Feb 10 00:20:41 kernel: [5032630.764715] ? cpumask_next+0x17/0x20 Feb 10 00:20:41 kernel: [5032630.764717] ? __update_idle_core+0x55/0xa0 Feb 10 00:20:41 kernel: [5032630.764721] ? __switch_to_asm+0x42/0x70 Feb 10 00:20:41 kernel: [5032630.764724] ? __switch_to+0x114/0x460 Feb 10 00:20:41 kernel: [5032630.764728] ? _cond_resched+0x16/0x40 Feb 10 00:20:41 kernel: [5032630.764746] sync_dnodes_task+0x58/0x130 [zfs] Feb 10 00:20:41 kernel: [5032630.764754] taskq_thread+0x2da/0x520 [spl] Feb 10 00:20:41 kernel: [5032630.764757] ? wake_up_q+0xa0/0xa0 Feb 10 00:20:41 kernel: [5032630.764761] ? taskq_thread_spawn+0x50/0x50 [spl] Feb 10 00:20:41 kernel: [5032630.764763] kthread+0x11b/0x140 Feb 10 00:20:41 kernel: [5032630.764765] ? __kthread_bind_mask+0x60/0x60 Feb 10 00:20:41 kernel: [5032630.764766] ret_from_fork+0x1f/0x30 Feb 10 00:20:41 kernel: [5032630.764769] CPU: 4 PID: 2328 Comm: dp_sync_taskq Tainted: P OE 5.10.0-9-amd64 #1 Debian 5.10.70-1 Feb 10 00:20:41 kernel: [5032630.764773] Hardware name: ASUS System Product Name/ROG STRIX B460-I GAMING, BIOS 0211 03/13/2020 Feb 10 00:20:41 kernel: [5032630.764774] Call Trace: Feb 10 00:20:41 kernel: [5032630.764779] dump_stack+0x6b/0x83 Feb 10 00:20:41 kernel: [5032630.764784] spl_panic+0xd4/0xfc [spl] Feb 10 00:20:41 kernel: [5032630.764818] ? zio_free_sync+0xda/0xf0 [zfs] Feb 10 00:20:41 kernel: [5032630.764821] ? _cond_resched+0x16/0x40 Feb 10 00:20:41 kernel: [5032630.764825] ? irq_exit_rcu+0x3e/0xc0 Feb 10 00:20:41 kernel: [5032630.764847] ? dbuf_sync_leaf+0x44f/0x590 [zfs] Feb 10 00:20:41 kernel: [5032630.764880] zpl_get_file_info+0x9b/0x220 [zfs] Feb 10 00:20:41 kernel: [5032630.764904] dmu_objset_userquota_get_ids+0x11d/0x490 [zfs] Feb 10 00:20:41 kernel: [5032630.764930] dnode_sync+0x11a/0xa30 [zfs] Feb 10 00:20:41 kernel: [5032630.764932] ? __switch_to_asm+0x42/0x70 Feb 10 00:20:41 kernel: [5032630.764934] ? __switch_to+0x114/0x460 Feb 10 00:20:41 kernel: [5032630.764935] ? _cond_resched+0x16/0x40 Feb 10 00:20:41 kernel: [5032630.764959] sync_dnodes_task+0x58/0x130 [zfs] Feb 10 00:20:41 kernel: [5032630.764963] taskq_thread+0x2da/0x520 [spl] Feb 10 00:20:41 kernel: [5032630.764965] ? wake_up_q+0xa0/0xa0 Feb 10 00:20:41 kernel: [5032630.764969] ? taskq_thread_spawn+0x50/0x50 [spl] Feb 10 00:20:41 kernel: [5032630.764971] kthread+0x11b/0x140 Feb 10 00:20:41 kernel: [5032630.764972] ? __kthread_bind_mask+0x60/0x60 Feb 10 00:20:41 kernel: [5032630.764974] ret_from_fork+0x1f/0x30 Feb 10 00:20:41 kernel: [5032630.764977] CPU: 1 PID: 2333 Comm: dp_sync_taskq Tainted: P OE 5.10.0-9-amd64 #1 Debian 5.10.70-1 Feb 10 00:20:41 kernel: [5032630.764979] Hardware name: ASUS System Product Name/ROG STRIX B460-I GAMING, BIOS 0211 03/13/2020 Feb 10 00:20:41 kernel: [5032630.764980] Call Trace: Feb 10 00:20:41 kernel: [5032630.764982] dump_stack+0x6b/0x83 Feb 10 00:20:41 kernel: [5032630.764986] spl_panic+0xd4/0xfc [spl] Feb 10 00:20:41 kernel: [5032630.764989] ? __wake_up_common+0x80/0x180 Feb 10 00:20:41 kernel: [5032630.765010] ? dbuf_sync_leaf+0x44f/0x590 [zfs] Feb 10 00:20:41 kernel: [5032630.765039] zpl_get_file_info+0x9b/0x220 [zfs] Feb 10 00:20:41 kernel: [5032630.765057] dmu_objset_userquota_get_ids+0x11d/0x490 [zfs] Feb 10 00:20:41 kernel: [5032630.765077] dnode_sync+0x11a/0xa30 [zfs] Feb 10 00:20:41 kernel: [5032630.765079] ? __switch_to_asm+0x42/0x70 Feb 10 00:20:41 kernel: [5032630.765081] ? __switch_to+0x114/0x460 Feb 10 00:20:41 kernel: [5032630.765082] ? _cond_resched+0x16/0x40 Feb 10 00:20:41 kernel: [5032630.765100] sync_dnodes_task+0x58/0x130 [zfs] Feb 10 00:20:41 kernel: [5032630.765104] taskq_thread+0x2da/0x520 [spl] Feb 10 00:20:41 kernel: [5032630.765105] ? wake_up_q+0xa0/0xa0 Feb 10 00:20:41 kernel: [5032630.765109] ? taskq_thread_spawn+0x50/0x50 [spl] Feb 10 00:20:41 kernel: [5032630.765110] kthread+0x11b/0x140 Feb 10 00:20:41 kernel: [5032630.765111] ? __kthread_bind_mask+0x60/0x60 Feb 10 00:20:41 kernel: [5032630.765113] ret_from_fork+0x1f/0x30
I found 2 issues with similar stack traces:
https://github.com/openzfs/zfs/issues/11433
https://github.com/openzfs/zfs/issues/11433
https://github.com/openzfs/zfs/issues/12659
https://github.com/openzfs/zfs/issues/12659
After the panic, sshd had lots of defunct processes with ppid 1 and I couldn't ssh to the backup server. The shutdown command was not working, thanks systemd, I pressed the power button. The shutdown command must always work.
root@backup ~ 02-10 08:40> shutdown Failed to set wall message, ignoring: Connection timed out Failed to call ScheduleShutdown in logind, no action will be taken: Connection timed out root@backup ~ 02-10 08:41> echo $? 1
I think in my case, the panic is due my Intel NUC running `zfs send` sends garbage data. I fried the CPU and the RAM and I get regular bit flips, this hardware is completely unreliable and I see checksum errors in the zfs datasets.
hashtags: #zfs