Dan Mick's Little Shop of Hints

« PCI device identific... | Main | Dell USB keyboard... »
20050615 Wednesday June 15, 2005

Diagnosing kernel hangs/panics with kmdb and moddebug

If you experience hangs or panics during Solaris boot, whether it's during installation or after you've already installed, using the kernel debugger can be a big help in collecting the first set of "what happened" information.

The kernel debugger is named "kmdb" in Solaris 10 and later, and is invoked by supplying the '-k' switch in the kernel boot arguments. So a common request from a kernel engineer starting to examine a problem is often "try booting with kmdb".

Sometimes it's useful to either set a breakpoint to pause the kernel startup and examine something, or to just set a kernel variable to enable or disable a feature, or enable debugging output. If you use -k to invoke kmdb, but also supply the '-d' switch, the debugger will be entered before the kernel really starts to do anything of consequence, so that you can set kernel variables or breakpoints.

So "booting with the -kd flags" is the key to "booting under the kernel debugger". Now, how do we do that?

Kernel debugging with GRUB-boot systems

On modern Solaris and OpenSolaris systems, GRUB is used to boot; to enable the kernel debugger, you add -kd arguments to the "kernel" (or "kernel$") line in the GRUB menu entry. When presented with the GRUB menu, hit 'e' to edit the entry, highlight the kernel line, and hit 'e' again to edit it; add the -kd arguments just after the /platform/i86pc/kernel/$ISADIR/unix argument, so that it says

kernel$ /platform/i86pc/kernel/$ISADIR/unix -kd
and then hit 'b' to boot that edited menu entry. '-k' means "start the debugger"; '-d' means "immediately enter the debugger after loading the kernel". After some booting status, you'll see the kernel debugger announce itself like this:

(The number in square brackets is the CPU that is running the kernel debugger; that number might change for later entries into the debugger.)

Now we're in the kernel debugger

There are two good reasons to run under the kernel debugger:
  1. If we panic, the panic can be examined before reboot; you can get stack backtraces and get some idea of which section of code might be at fault.
  2. Now we can set kernel variables, set breakpoints, etc. to affect the kernel run.
Obviously, there's a lot you can do in a kernel debugger, and I'm only touching on it here, but here are two good ones:
  1. For investigating hangs: try turning on module debugging output. You can set the value of a kernel variable by using the '/W' command ("write a 32-bit value"). Here's how you set moddebug to 0x80000000, and then continue execution of the kernel:
    [0]> moddebug/W 80000000
    [0]> :c
    That will give you debug output for each kernel module that loads. (see /usr/include/sys/modctl.h, near the bottom, for moddebug flag information. I find 0x80000000 is the only one I really ever use.)
  2. To collect information about panics: when the kernel panics, it will drop into the debugger, and print some interesting information; however, usually the most interesting thing, first, is the stack backtrace; this shows, in reverse order, all the functions that were active at the time of panic. To generate a stack backtrace, use
    [0]> $c

    A few other very useful information commands during a panic are

    which will show you the last things the kernel printed onscreen, and
    which shows a summary of the state of the machine in panic.
  3. If you're running the kernel while the kernel debugger is active, and you experience a hang, you may be able to break into the debugger to examine the system state; you can do this by pressing the <F1> and <A> keys at the same time (a sort of "F1-shifted-A" keypress). (On SPARC systems, this key sequence is <Stop>-<A>.) This should give you the same debugger prompt as above, although on a multi-CPU system you may see the CPU number in the prompt is something other than 0. Once in the kernel debugger, you can get a stack backtrace as above; you can also use ::switch to change the CPU and get stack backtraces on the different CPU, which might shed more light on the hang. For instance, if you break into the debugger on CPU 1, you could switch to CPU 0 with
    [1]> 0::switch

There's obviously a lot more you can do with the kernel debugger, but these small tips will sometimes help get from a "I have no idea what to do" to "I have a few ideas to try that might let me continue to boot or install", which can make all the difference.

Technorati Tag: ( Jun 15 2005, 04:26:17 PM PDT ) Permalink Comments [6]

Trackback URL: http://blogs.sun.com/dmick/entry/diagnosing_kernel_hangs_panics_with

Hi, thanks for this insight into kmdb. I have recently encountered an issue where I tried kmdb to solve it. It did, but not in the way I envisioned: The problem concerns patch 118844-20 and its required patch 118344-05 I am running Solaris 10 x86 3/05 release inside VMWare workstation build 5.0.0.-13124 Here are the symptoms: I did all patch installation in single-user mode using patchadd. After 118344-05 is installed, the reboot comes up just fine. After 118844-20 is installed, the reboot comes up, but very quickly a stacktrace of some sort is seen before the system auto-reboots again (and the BIOS screen is shown). The trace is not shown long enough to see exactly what is printed. The system now keeps on trying to boot, then going back to BIOS init, etc Attempting to see the trace, I boot with flags: 'b kmdb -d -s' (this was before I read Dan's kmdb article) but to my astonishment, the system booted just fine into single-user mode. The patches show as installed. I rebooted again, this time trying with flag 'b mkdb' only, and again I see no boot problems and I got into the Xserver just fine. Very strange. I will try once more with option 'b -kd' and see what happens. And I'll try out the kmdb commands when I can. Thanks again!

Posted by Gert-Jan Bartelds on November 25, 2005 at 03:35 AM PST #

It would be great if the kernel takes an argument, say, -R (verbose Reconfiguration) which essentially set moddebug to (0x)80000000 upon initialization so that we can see the modules loaded during reconfiguration and can pick up which module is suspect for a hung.

Posted by on March 01, 2006 at 07:45 AM PST #

I suppose that could sometimes be useful, although not often enough that it doesn't seem like it would warrant a kernel option. Given that it's easy to do with kmdb or /etc/system, those seem sufficient. What might be nice is an interactive way to set variables, in essence either a command-line switch or a prompt to be able to dynamically, one-time add things to /etc/system. We have that for properties; making one for kernel or module variable settings shouldn't be that hard.

Posted by Dan Mick on March 01, 2006 at 05:44 PM PST #

Eric Lowe has a good tip about using prom_debug here.

Posted by Dan Mick on October 17, 2006 at 05:39 PM PDT #

my system booted just fine into single-user mode.

Posted by itwik on April 15, 2007 at 08:13 PM PDT #

We are sure you can't resist the temptation of Links of London Czar Cross Charm,because it is so amazing.

Posted by links of london on January 29, 2010 at 01:04 AM PST #

Post a Comment:


Your Comment:

HTML Syntax: NOT allowed


RSS Feeds