Tuesday, October 13, 2015

Debugging kernel modules: from an Oops to a line number in the C source code.

Debugging kernel modules is not fun especially when you manage to get the kernel to completely hang. Here is how I have my test environment set up to capture the kernel oops, figure out which instruction caused the oops and find the equivalent line in my module C code.

The setup:
I use VirtualBox to run the test kernel and module. I currently have Debian testing running as the guest.
 
Add a virtual serial port to the guest machine and forward it to a raw file somewhere on your host machine.
Modify your default grub configuration to enable the serial console. Do this by modifying the GRUB_CMDLINE_LINUX= line in /etc/default/grub to be: 
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8"


Save the /etc/default/grub file and run update-grub to write a new grub.cfg file with the serial console enabled.

Now CRASH your kernel. 

Open the file you have the serial port forwarded to. In my case, it's vbox_serial.txt in Documents as specified in the image. Locate the kernel Oops. This is how the beginning of mine looks.

[  628.584464] BUG: unable to handle kernel paging request at 0000007d00000097
[  628.600354] IP: [<ffffffffa00a819e>] populate_probe_log+0x19e/0x250 [tcpoutput_probe]
[  628.615789] PGD 0
[  628.615990] Oops: 0000 [#1] SMP
[  628.628324] Modules linked in: tcpoutput_probe(OE) sch_netem veth vboxsf(OE) autofs4 nfsv3 openvswitch gre vxlan libcrc32c rpcsec_gss_krb5 nfsv4 nfsd auth_rpcgss nfs_acl nfs snd_intel8x0 lockd snd_ac97_codec sunrpc ac97_bus crct10dif_pclmul snd_pcm crc32_pclmul ghash_clmulni_intel ppdev aesni_intel vboxguest(OE) vboxvideo(OE) aes_x86_64 fscache snd_timer drm lrw gf128mul snd glue_helper ablk_helper cryptd joydev serio_raw video parport_pc soundcore i2c_piix4 mac_hid lp parport hid_generic usbhid hid psmouse ahci e1000 libahci pata_acpi 

The second line shows the function where the crash happened and the offset of the instruction that caused it. populate_probe_log+0x19e/0x250 means populate_probe_log crashed and the responsible instruction is at offset 0x19e from the beginning of the function. The whole size of the function is 0x250 bytes.

Finding the C code responsible for the crash:
Step 1: Find the offset of the function in the module binary.
Use objdump to dump the binary as follows:
objdump <module.ko> -d
In my case, I ran objdump tcpoutput_probe.ko -d
From the output I could see that the function with the problem is at offset 0:
Step 2: Find the line causing the crash in the module source code.
Use addr2line tool as follows:
addr2line -e <module.ko> <instruction offset>
I used it as follows: addr2line -e tcpoutput_probe.ko 0x19e. This offset is simply the function offset plus the offset of the instruction that caused the crash (0 + 0x19e).
And this is how the output looks like!

My crash is caused by line 177 in the tcpoutput_probe.c.


Next, actually fixing the bug :)


Acknowledgements:
Thanks to Ken Platz for pointing out the addr2line tool.