Kernel Panics

From TMUG - The Triangle Macintosh Users Group

Jump to: navigation, search

Contents

Kernel Panics

Presented by Stefan Jeglinski as the Main Program for the TMUG meeting on Sep 10 2007.



The Kernel Panic - what the heck is it?

First time a computer user hears the term "kernel panic" they usually snicker - is it called that because of some humor on the part of programmers? Some hear it and surmise that it is a "colonel panic," with a picture in their mind that usually involves either an Army officer or a bucket of chicken. On OSX, it isn't that colorful, but you will consider that the operating system handles many languages when you see the unmistakable sign that you have experienced one in either Panther or Tiger:

MAC OS X kernel panic screen
MAC OS X kernel panic screen

If you could actually run Windows on a Mac, and Windows had a kernel panic, you'd see the blue screen of death below:

Windows blue screen of death - not exactly
Windows blue screen of death - not exactly


As you can plainly see, a kernel panic on Windows is bigger than the same on OSX :-)

As a point of clarification, in the Windows world it isn't kernel panic. The blue screen of death is known as a stop error or bug check. [1] Though it won't be long before it starts getting referred to as a kernel panic.

Peeking at the source code for a Kernel Panic

Parts of the Darwin operating system are considered by some to be "open source." Opinions vary on what this means to Apple, but it is certainly true that parts of the core OS have source code that is published.

In particular, the xnu part of the source code, which can be considered the main kernel, contains a routine called panic(...) in debug.c:

void
panic(const char *str, ...)
{
	va_list	listp;
	spl_t	s;
	thread_t thread;
	wait_queue_t wq;

	s = splhigh();
	disable_preemption();

#ifdef	__ppc__
	lastTrace = LLTraceSet(0);		/* Disable low-level tracing */
#endif

	thread = current_thread();		/* Get failing thread */
	wq = thread->wait_queue;		/* Save the old value */
	thread->wait_queue = 0;			/* Clear the wait so we do not get double panics when we try locks */

	if( logPanicDataToScreen )
		disableDebugOuput = FALSE;
		
	debug_mode = TRUE;

	/* panic_caller is initialized to 0.  If set, don't change it */
	if ( ! panic_caller )
		panic_caller = (unsigned long) __builtin_return_address(0);
	
restart:
	PANIC_LOCK();
	if (panicstr) {
		if (cpu_number() != paniccpu) {
			PANIC_UNLOCK();
			/*
			 * Wait until message has been printed to identify correct
			 * cpu that made the first panic.
			 */
			while (panicwait)
				continue;
			goto restart;
	    } else {
			nestedpanic +=1;
			PANIC_UNLOCK();
			Debugger("double panic");
			printf("double panic:  We are hanging here...\n");
			while(1);
			/* NOTREACHED */
		}
	}
	panicstr = str;
	paniccpu = cpu_number();
	panicwait = 1;

	PANIC_UNLOCK();
	kdb_printf("panic(cpu %d caller 0x%08X): ", (unsigned) paniccpu, panic_caller);
	va_start(listp, str);
	_doprnt(str, &listp, consdebug_putc, 0);
	va_end(listp);
	kdb_printf("\n");

	/*
	 * Release panicwait indicator so that other cpus may call Debugger().
	 */
	panicwait = 0;
	Debugger("panic");
	/*
	 * Release panicstr so that we can handle normally other panics.
	 */
	PANIC_LOCK();
	panicstr = (char *)0;
	PANIC_UNLOCK();
	thread->wait_queue = wq; 	/* Restore the wait queue */
	if (return_on_panic) {
		enable_preemption();
		splx(s);
		return;
	}
	kdb_printf("panic: We are hanging here...\n");
	while(1);
	/* NOTREACHED */
}

A search of the xnu source code turns up some 1800 examples of calls to the panic routine. For example, in the code segment trap.c:

 void unresolved_kernel_trap(int trapno, struct savearea *ssp, unsigned int dsisr, addr64_t dar, const char *message)
 {
   char *trap_name;
   extern void print_backtrace(struct savearea *);
   extern unsigned int debug_mode, disableDebugOuput;
   extern unsigned long panic_caller;

   ml_set_interrupts_enabled(FALSE);					/* Turn off interruptions */
   lastTrace = LLTraceSet(0);							/* Disable low-level tracing */

   if( logPanicDataToScreen) disableDebugOuput = FALSE;

   debug_mode++;
   if ((unsigned)trapno <= T_MAX)
     trap_name = trap_type[trapno / T_VECTOR_SIZE];
   else
     trap_name = "???? unrecognized exception";
   if (message == NULL)
     message = trap_name;

   kdb_printf("\n\nUnresolved kernel trap(cpu %d): %s DAR=0x%016llX PC=0x%016llX\n",
     cpu_number(), trap_name, dar, ssp->save_srr0);

   print_backtrace(ssp);

   panic_caller = (0xFFFF0000 | (trapno / T_VECTOR_SIZE) );
   draw_panic_dialog();
		
   if( panicDebugging )
     (void *)Call_Debugger(trapno, ssp);
   panic(message);
 }

The above example is nice in that it contains all the parts we are expecting to see: a message is passed in, the trap is classified, the message is given the trap name if it is blank, the cpu is identified, the error is logged, the backtrace is printed, the panic dialog is drawn on screen, and finally the actual panic is issued, which terminates in an endless loop.

No, really - why does it happen?

A kernel panic happens when the code decides that the operating system cannot continue running. For this to happen, it must encounter an error circumstance that it cannot extract itself from. Being told to jump to an illegal memory location during the execution of the code is one example of how this might come about. It may come as a surprise that in this regard, Unix is not fault-tolerant... under certain conditions, it will not try to recover from an error - it will just die.

possible hardware causes

  1. Memory errors caused by bad RAM (chip defects or timing)
  2. Logic board defects (chips and/or connections)
  3. Hard drive failures
  4. Processor timing errors (upgrades)
  5. Excessive heat
  6. Misbehaving peripheral devices (USB/Firewire devices, hubs, PCI cards, hard drives)

possible software causes

  1. Kernel bugs (possible but unlikely)
  2. Hardware driver bugs (much more likely)
  3. Missing, corrupted, or permissions-broken critical system components
  4. User-level software (unlikely, because of protected memory, but...)


How to rid yourself of kernel panics.

Ridding yourself of kernel panics is a trial-and-error process that can leave you wondering if you really have fixed things. There is typically a "panic.log" written to /Library/Logs/. It might be written before the lockup, or written to nvram and then from there to a text file on the next reboot. This file may contain important clues as to the cause of the panic. For example:

 Kernel version:
Darwin Kernel Version 8.8.0: Fri Sep  8 17:18:57 PDT 2006; root:xnu-792.12.6.obj~1/RELEASE_PPC
panic(cpu 1 caller 0xFFFF0003): 0x300 - Data access
Latest stack backtrace for cpu 1:
     Backtrace:
        0x00095138 0x00095650 0x00026898 0x000A7E04 0x000AB780 
Proceeding back via exception chain:
  Exception state (sv=0x2CA1B500)
     PC=0x0007DED8; MSR=0x00009030; DAR=0x80018CD4; DSISR=0x40000000; LR=0x0007F1D4; R1=0x1C9FBAD0; XCP=0x0000000C (0x300 - Data access)
     Backtrace:
0x00048F3C 0x0007F1D4 0x002D7220 0x007C7154 0x007D7064 0x007CB7AC 
        0x002EB640 0x0008C59C 0x0002921C 0x000233F8 0x000ABAAC 0xFF203C4E 
     Kernel loadable modules in backtrace (with dependencies):
        com.apple.ATIRage128(4.0.4)@0x7c3000
           dependency: com.apple.iokit.IOPCIFamily(1.7)@0x461000
           dependency: com.apple.iokit.IOGraphicsFamily(1.4.1)@0x704000
           dependency: com.apple.iokit.IONDRVSupport(1.4.1)@0x728000
  Exception state (sv=0x31BA9280)
     PC=0x9000AB48; MSR=0x0000D030; DAR=0xE00F0000; DSISR=0x42000000; LR=0x9000AA9C; R1=0xBFFFE2B0; XCP=0x00000030 (0xC00 - System call)

Kernel loadable modules appear in reverse order; in the above example, whatever happened appears to have happened in conjuction with the ATI card in the computer.

Unresolved kernel trap(cpu 1): 0x300 - Data access DAR=0x00000000FFFFFFFF PC=0x00000000000565E0
Latest crash info for cpu 1:
  Exception state (sv=0x31143280)
     PC=0x000565E0; MSR=0x00009030; DAR=0xFFFFFFFF; DSISR=0x00200000; LR=0x00053F10; R1=0x1EBF3C70; XCP=0x0000000C (0x300 - Data access)
     Backtrace:
        0x000685A0 0x00053FB0 0x00058B24 0x00059128 0x00090DF0 0x00093C8C 
        backtrace terminated - frame not mapped or invalid: 0xF0130EB0
Proceeding back via exception chain:
  Exception state (sv=0x31143280)
     previously dumped as "Latest" state. skipping...
  Exception state (sv=0x2FD14C80)
     PC=0x00A98DAC; MSR=0x0000F030; DAR=0x07234000; DSISR=0x42000000; LR=0x00AD272C; R1=0xF0130EB0; XCP=0x0000000C (0x300 - Data access)
Kernel version:
Darwin Kernel Version 7.3.0:
Fri Mar  5 14:22:55 PST 2004; root:xnu/xnu-517.3.15.obj~4/RELEASE_PPC

The unresolved kernel trap (see source code above) is a catchall for a condition that is not included in the code.

hardware checks and tests

  1. Replace RAM (test RAM using AppleCare Hardware CD, TechTool Pro, memtest)
  2. Repair/replace logic board
  3. Replace hard drive
  4. Reinstall original processor (upgrade scenario)
  5. Resolve heating issues
  6. Remove external devices (USB/Firewire drives/devices, USB hubs, PCI cards, mouse/keyboard)

software checks and tests

  1. Look for clues in system log in System Profiler or terminal: tail -f /var/log/system/log
  2. Look for clues in console log in System Profiler or terminal: tail -f /Library/Logs/Console/<user>/console.log
  3. Look for clues in /Library/Logs/panic.log
  4. Remove unused kernel extensions ("kexts") in /System/Library/Extensions/
  5. Boot single user mode, remove /System/Library/Extensions.kextcache and /System/Library/Extensions.mkext and reboot.
  6. Boot in safe mode and remove login items
  7. fsck, repair permissions, remove cache (and other) files using AppleJack
  8. If possible, isolate to user level (create new user and see if panics continue)
  9. Reinstall OS

References not included above

  1. Amit Singh's OSX Internals
  2. Wikipedia's Kernel Panic Article
  3. Apple's Definition
  4. X Lab Article
Personal tools
Members