Less known Solaris features: About crashes and cores - Part 1: Introduction

No software is without errors. This is a basic law of computer science. And when there is no bug in the software (by a strange kind of luck) your hardware has bugs. And when there are no bugs in the hardware, cosmic rays are flipping bits. Thus an operating system needs some mechanisms to stop a process or the complete kernel at once without allowing the system to write anything back to disk and thus manifesting the corrupted state. This tutorial will cover the most important concepts surrounding the last life signs of a system or an application.

A plea for the panic

The panic isn´t the bug, it´s the reaction of the system to a bug in the system. Many people think of panics as the result of an instability and something bad like the bogey man. But: Panics and crash dumps are your friend. Whenever the system detects a inconsistency in it´s structures, it does the best what it could to: protect your data. And the best way to do this, is to give the system a fresh start, don´t try to modify the data on the disk and write some status information to a special device to enable analysis of the problem. The concepts of panic and crash dump were developed to give the admin exactly such tools. A good example: Imagine a problem in the UFS, a bit has flipped. The operating environment detects an inconsistence in the data structures. You can´t work with this error. It would unpredictably alter the data on your your disk. You can´t shutdown the system by a normal reboot. The flushing of your disks would alter the data on the disk. The only way to get out of the system: Stop everything and restart the system, ergo panic the system and write a crash dump. Furthermore: Some people look at the core and crash dumps and think about their analysis as an arcane art and see them as a wast of disk space. But it´s really easy to get some basic data and hints out of this large heaps of data.

Difference between Crash Dumps and Core Dumps

Many people use this words synonymously (“The system paniced and wrote a core dump”). Every now and then i´m doing this as well. But this isn´t correct. The scope of the dump is quite different: