Zfs kernel panic

Feed

date: 2022-02-10 17:25:08

categories: linux

firstPublishDate: 2022-02-10 17:25:08

Last night my backup server was in panic during a zfs recv command, I'm running Debian bullseye:

Linux 5.10.0-11-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) x86_64 GNU/Linux
Source: zfs-linux
Version: 2.0.3-9

Feb 10 00:20:41 kernel: [5032630.764473] VERIFY3(sa.sa_magic == SA_MAGIC) failed (2576474 == 3100762)
Feb 10 00:20:41 kernel: [5032630.764474] VERIFY3(sa.sa_magic == SA_MAGIC) failed (2576474 == 3100762)
Feb 10 00:20:41 kernel: [5032630.764476] PANIC at zfs_quota.c:89:zpl_get_file_info()
Feb 10 00:20:41 kernel: [5032630.764480] PANIC at zfs_quota.c:89:zpl_get_file_info()
Feb 10 00:20:41 kernel: [5032630.764482] Showing stack for process 2333
Feb 10 00:20:41 kernel: [5032630.764482] Showing stack for process 2327
Feb 10 00:20:41 kernel: [5032630.764484] CPU: 3 PID: 2327 Comm: dp_sync_taskq Tainted: P           OE     5.10.0-9-amd64 #1 Debian 5.10.70-1
Feb 10 00:20:41 kernel: [5032630.764485] Hardware name: ASUS System Product Name/ROG STRIX B460-I GAMING, BIOS 0211 03/13/2020
Feb 10 00:20:41 kernel: [5032630.764486] Call Trace:
Feb 10 00:20:41 kernel: [5032630.764496]  dump_stack+0x6b/0x83
Feb 10 00:20:41 kernel: [5032630.764508]  spl_panic+0xd4/0xfc [spl]
Feb 10 00:20:41 kernel: [5032630.764617]  ? dbuf_sync_leaf+0x44f/0x590 [zfs]
Feb 10 00:20:41 kernel: [5032630.764667]  zpl_get_file_info+0x9b/0x220 [zfs]
Feb 10 00:20:41 kernel: [5032630.764687]  dmu_objset_userquota_get_ids+0x11d/0x490 [zfs]
Feb 10 00:20:41 kernel: [5032630.764687] VERIFY3(sa.sa_magic == SA_MAGIC) failed (2576474 == 3100762)
Feb 10 00:20:41 kernel: [5032630.764688] PANIC at zfs_quota.c:89:zpl_get_file_info()
Feb 10 00:20:41 kernel: [5032630.764689] Showing stack for process 2328
Feb 10 00:20:41 kernel: [5032630.764710]  dnode_sync+0x11a/0xa30 [zfs]
Feb 10 00:20:41 kernel: [5032630.764715]  ? cpumask_next+0x17/0x20
Feb 10 00:20:41 kernel: [5032630.764717]  ? __update_idle_core+0x55/0xa0
Feb 10 00:20:41 kernel: [5032630.764721]  ? __switch_to_asm+0x42/0x70
Feb 10 00:20:41 kernel: [5032630.764724]  ? __switch_to+0x114/0x460
Feb 10 00:20:41 kernel: [5032630.764728]  ? _cond_resched+0x16/0x40
Feb 10 00:20:41 kernel: [5032630.764746]  sync_dnodes_task+0x58/0x130 [zfs]
Feb 10 00:20:41 kernel: [5032630.764754]  taskq_thread+0x2da/0x520 [spl]
Feb 10 00:20:41 kernel: [5032630.764757]  ? wake_up_q+0xa0/0xa0
Feb 10 00:20:41 kernel: [5032630.764761]  ? taskq_thread_spawn+0x50/0x50 [spl]
Feb 10 00:20:41 kernel: [5032630.764763]  kthread+0x11b/0x140
Feb 10 00:20:41 kernel: [5032630.764765]  ? __kthread_bind_mask+0x60/0x60
Feb 10 00:20:41 kernel: [5032630.764766]  ret_from_fork+0x1f/0x30
Feb 10 00:20:41 kernel: [5032630.764769] CPU: 4 PID: 2328 Comm: dp_sync_taskq Tainted: P           OE     5.10.0-9-amd64 #1 Debian 5.10.70-1
Feb 10 00:20:41 kernel: [5032630.764773] Hardware name: ASUS System Product Name/ROG STRIX B460-I GAMING, BIOS 0211 03/13/2020
Feb 10 00:20:41 kernel: [5032630.764774] Call Trace:
Feb 10 00:20:41 kernel: [5032630.764779]  dump_stack+0x6b/0x83
Feb 10 00:20:41 kernel: [5032630.764784]  spl_panic+0xd4/0xfc [spl]
Feb 10 00:20:41 kernel: [5032630.764818]  ? zio_free_sync+0xda/0xf0 [zfs]
Feb 10 00:20:41 kernel: [5032630.764821]  ? _cond_resched+0x16/0x40
Feb 10 00:20:41 kernel: [5032630.764825]  ? irq_exit_rcu+0x3e/0xc0
Feb 10 00:20:41 kernel: [5032630.764847]  ? dbuf_sync_leaf+0x44f/0x590 [zfs]
Feb 10 00:20:41 kernel: [5032630.764880]  zpl_get_file_info+0x9b/0x220 [zfs]
Feb 10 00:20:41 kernel: [5032630.764904]  dmu_objset_userquota_get_ids+0x11d/0x490 [zfs]
Feb 10 00:20:41 kernel: [5032630.764930]  dnode_sync+0x11a/0xa30 [zfs]
Feb 10 00:20:41 kernel: [5032630.764932]  ? __switch_to_asm+0x42/0x70
Feb 10 00:20:41 kernel: [5032630.764934]  ? __switch_to+0x114/0x460
Feb 10 00:20:41 kernel: [5032630.764935]  ? _cond_resched+0x16/0x40
Feb 10 00:20:41 kernel: [5032630.764959]  sync_dnodes_task+0x58/0x130 [zfs]
Feb 10 00:20:41 kernel: [5032630.764963]  taskq_thread+0x2da/0x520 [spl]
Feb 10 00:20:41 kernel: [5032630.764965]  ? wake_up_q+0xa0/0xa0
Feb 10 00:20:41 kernel: [5032630.764969]  ? taskq_thread_spawn+0x50/0x50 [spl]
Feb 10 00:20:41 kernel: [5032630.764971]  kthread+0x11b/0x140
Feb 10 00:20:41 kernel: [5032630.764972]  ? __kthread_bind_mask+0x60/0x60
Feb 10 00:20:41 kernel: [5032630.764974]  ret_from_fork+0x1f/0x30
Feb 10 00:20:41 kernel: [5032630.764977] CPU: 1 PID: 2333 Comm: dp_sync_taskq Tainted: P           OE     5.10.0-9-amd64 #1 Debian 5.10.70-1
Feb 10 00:20:41 kernel: [5032630.764979] Hardware name: ASUS System Product Name/ROG STRIX B460-I GAMING, BIOS 0211 03/13/2020
Feb 10 00:20:41 kernel: [5032630.764980] Call Trace:
Feb 10 00:20:41 kernel: [5032630.764982]  dump_stack+0x6b/0x83
Feb 10 00:20:41 kernel: [5032630.764986]  spl_panic+0xd4/0xfc [spl]
Feb 10 00:20:41 kernel: [5032630.764989]  ? __wake_up_common+0x80/0x180
Feb 10 00:20:41 kernel: [5032630.765010]  ? dbuf_sync_leaf+0x44f/0x590 [zfs]
Feb 10 00:20:41 kernel: [5032630.765039]  zpl_get_file_info+0x9b/0x220 [zfs]
Feb 10 00:20:41 kernel: [5032630.765057]  dmu_objset_userquota_get_ids+0x11d/0x490 [zfs]
Feb 10 00:20:41 kernel: [5032630.765077]  dnode_sync+0x11a/0xa30 [zfs]
Feb 10 00:20:41 kernel: [5032630.765079]  ? __switch_to_asm+0x42/0x70
Feb 10 00:20:41 kernel: [5032630.765081]  ? __switch_to+0x114/0x460
Feb 10 00:20:41 kernel: [5032630.765082]  ? _cond_resched+0x16/0x40
Feb 10 00:20:41 kernel: [5032630.765100]  sync_dnodes_task+0x58/0x130 [zfs]
Feb 10 00:20:41 kernel: [5032630.765104]  taskq_thread+0x2da/0x520 [spl]
Feb 10 00:20:41 kernel: [5032630.765105]  ? wake_up_q+0xa0/0xa0
Feb 10 00:20:41 kernel: [5032630.765109]  ? taskq_thread_spawn+0x50/0x50 [spl]
Feb 10 00:20:41 kernel: [5032630.765110]  kthread+0x11b/0x140
Feb 10 00:20:41 kernel: [5032630.765111]  ? __kthread_bind_mask+0x60/0x60
Feb 10 00:20:41 kernel: [5032630.765113]  ret_from_fork+0x1f/0x30

I found 2 issues with similar stack traces:

https://github.com/openzfs/zfs/issues/11433

https://github.com/openzfs/zfs/issues/11433

https://github.com/openzfs/zfs/issues/12659

https://github.com/openzfs/zfs/issues/12659

After the panic, sshd had lots of defunct processes with ppid 1 and I couldn't ssh to the backup server. The shutdown command was not working, thanks systemd, I pressed the power button. The shutdown command must always work.

root@backup ~ 02-10 08:40> shutdown
  Failed to set wall message, ignoring: Connection timed out
  Failed to call ScheduleShutdown in logind, no action will be taken: Connection timed out

root@backup ~ 02-10 08:41> echo $?
1

I think in my case, the panic is due my Intel NUC running `zfs send` sends garbage data. I fried the CPU and the RAM and I get regular bit flips, this hardware is completely unreliable and I see checksum errors in the zfs datasets.

hashtags: #zfs

Feed