Linux gets double-quick double-update to fix kernel Oops!

Linux has never suffered from the infamous BSoD, short for blue screen of death, the name given to the dreaded “something went terribly wrong” message associated with a Windows system crash.

Microsoft has tried many things over the years to shake that nickname “BSoD”, including changing the background colour used when crash messages appear, adding a super-sized sad-face emoticon to make the message feel more compassionate, displaying QR codes that you can snap with your phone to help you diagnose the problem, and not filling the screen with a technobabble list of kernel code objects that just happened to be loaded at the time.

(Those crash dump lists often led to anti-virus and threat-prevention software being blamed for every system crash, simply because their names tended to show up at or near the top of the list of loaded modules – not because they had anything to do with the crash, but because they generally loaded early on and just happened to be at the top of the list, thus making a convenient scaepgoat.)

Even better, “BSoD” is no longer the everyday, throwaway pejorative term that it used to be, because Windows crashes a lot less often than it used to.

We’re not suggesting that Windows never crashes, or imlying that it is now magically bug-free; merely noting that you generally don’t need the word BSoD as often as you used to.

Linux crash notifications

Of course, Linux has never had BSoDs, not even back when Windows seemed to have them all the time, but that’s not because Linux never crashes, or is magically bug-free.

It’s simply that Linux does’t BSoD (yes, the term can be used as an intransitive verb, as in “my laptop BSoDded half way through an email”), because – in a delightful understatment – it suffers an oops, or if the oops is severe enough that the system can’t reliably stay up even with degraded performance, it panics.

(It’s also possible to configure a Linux kernel so that an oops always get “promoted” to a panic, for environments where security considerations make it better to have a system that shuts down abruptly, albeit with some data not getting saved in time, than a system that ends up in an uncertain state that could lead to data leakage or data corruption.)

An oops typically produces console output something like this (we’ve provided source code below if you want to explore oopses and panics for yourself):

[12710.153112] oops init (level = 1)
[12710.153115] triggering oops via BUG()
[12710.153127] ------------[ cut here ]------------
[12710.153128] kernel BUG at /home/duck/Articles/linuxoops/oops.c:17!
[12710.153132] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[12710.153748] CPU: 0 PID: 5531 Comm: insmod . . . [12710.154322] Hardware name: XXXX
[12710.154940] RIP: 0010:oopsinit+0x3a/0xfc0 [oops]
[12710.155548] Code: . . . . .
[12710.156191] RSP: . . . EFLAGS: . . .
[12710.156849] RAX: . . . RBX: . . . RCX: . . .
[12710.157513] RDX: . . . RSI: . . . RDI: . . .
[12710.158171] RBP: . . . R08: . . . R09: . . .
[12710.158826] R10: . . . R11: . . . R12: . . .
[12710.159483] R13: . . . R14: . . . R15: . . .
[12710.160143] FS: . . . GS: . . . knlGS: . . . . . . . .
[12710.163474] Call Trace:
[12710.164129] [12710.164779] do_one_initcall+0x56/0x230
[12710.165424] do_init_module+0x4a/0x210
[12710.166050] __do_sys_finit_module+0x9e/0xf0
[12710.166711] do_syscall_64+0x37/0x90
[12710.167320] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[12710.167958] RIP: 0033:0x7f6c28b15e39
[12710.168578] Code: . . . . .
[. . . . .
[12710.173349] [12710.174032] Modules linked in: . . . . .
[12710.180294] ---[ end trace 0000000000000000 ]---

Unfortunately, when kernel version 6.2.3 came out at the end of last week, two tiny changes quickly proved to be problematic, with users reporting kernel oopses when managing disk storage.

Kernel 6.1.16 was apparently subject to the same changes, and thus prone to the same oopsiness.

For example, plugging in an removable drive and mounting it worked fine, but unmounting the drive when you’d finished with it could cause an oops.

Although an oops doesn’t immediately freeze the whole computer, kernel-level code crashes when umounting disk storage are worrisone enough that a well-informed user would probably want to shut down as soon as possible, in case of ongoing trouble leading to data corruption…

…but some users reported that the oops prevented what’s known in the jargon as an orderly shutdown, requiring forcibly cycling the power, by holding down the power button for several seconds, or temporarily cutting the mains supply to a server.

The good news is that kernels 6.2.4 and 6.1.17 were immediately released over the weekend to roll back the problems.

Given the velocity of Linux kernel releases, those updates have already been followed by 6.2.5 and 6.1.18, which were themselves updated (today, 2023-03-13) by 6.2.6 and 6.1.19.

What to do?

If you are using a 6.x-version Linux kernel and you aren’t already bang up-to-date, make sure you don’t install 6.2.3 or 6.1.16 along the way.

If you’ve already got one of those versions (we had 6.2.3 for a couple of days and were unable to provoke a driver crash, presumably because our kernel configuration shielded us inadvertently from triggering the bug), consider updating as soon as you can…

…because even if you haven’t suffered any disk-volume-based trouble so far, you may be immune by good fortune, but by upgrading your kernel again you will become immune by design.

EXPLORING OOPS AND PANIC EVENTS ON YOUR OWN

You will need a kernel built from source code that’s already installed on your test computer.

Create a directory, let’s call it /test/oops, and save this source code as oops.c:

#include <linux/kernel.h> #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/init.h> MODULE_LICENSE("GPL"); static int level = 0;
module_param(level,int,0660); static int oopsinit(void) { printk("oops init (level = %d)\n",level); // level: 0->just load; 1->oops; 2->panic switch (level) { case 1: printk("triggering oops via BUG()\n"); BUG(); break; case 2: printk("forcing a full-on panic()\n"); panic("oops module"); break; } return 0; } static void oopsexit(void) { printk("oops exit\n"); } module_init(oopsinit); module_exit(oopsexit);

Create a file in the same directory called Kbuild to control the build parameters, like this:

 EXTRA_CFLAGS = -Wall -g obj-m = oops.o

Then build the module as shown below.

The -C option tells make where to start looking for Makefiles, thus pointing the build process at the right kernel source code tree, and the M= setting tells make where to find the actual module code to build on this occasion.

You must provide the full, absolute path for M=, so don’t try to save typing by using ./ (the current directory moves around during the build process):

/test/oops$ make -C /where/you/built/the/kernel M=/test/oops
CC [M] /home/duck/Articles/linuxoops/oops.o
MODPOST /home/duck/Articles/linuxoops/Module.symvers
CC [M] /home/duck/Articles/linuxoops/oops.mod.o
LD [M] /home/duck/Articles/linuxoops/oops.ko

You can load and unload the new oops.ko kernel module with the parameter level=0 just to check that it works.

Look in dmesg for a log of the init and exit calls:

/test/oops# insmod oops.ko level=0
/test/oops# rmmod oops
/test/oops# dmesg
. . .
[12690.998373] oops: loading out-of-tree module taints kernel.
[12690.999113] oops init (level = 0)
[12704.198814] oops exit

To provoke an oops (recoverable) or a panic (will hang your computer), use level=1 or level=2 respectively.

Don’t forget to save all your work before triggering either condition (you will need to reboot afterwards), and don’t do this on someone else’s computer without formal permission.

Perpetual IT