Rescue

Revision / Modified: Jan. 28, 2002
Author: Tom Berger

Original documents:
http://www.mandrakeuser.org/docs/admin/arecov.html http://www.mandrakeuser.org/docs/admin/arecov2.html http://www.mandrakeuser.org/docs/admin/arecov3.html http://www.mandrakeuser.org/docs/admin/arecov4.html

Rescue vs Reinstalling

Very often less experienced Linux users counter system problems with the troubleshooting technique they've learned on other operating systems: they reinstall the whole operating system.
This of course does not only cost quite a massive account of time, it also prevents them to feel really secure with the system, all the more since these users usually do not keep backups.

There are very few compelling reasons to ever reinstall Linux, like massive file system corruption or a hard drive failure. You can repair everything else either from within the system or from outside (via a network connection or a boot disk/CD). In contrast to other operating systems all configuration files are in plain text and can be edited with the most simple text editor. Furthermore you can reinstall, upgrade or downgrade every part of the system, since Linux only needs a minimal set of files to provide the basic functions of an operating system.

Being able to repair a system is what makes you an administrator, it is arguably the most important step from being controlled by your system to take control of it.

Basic Rescue Tools

There is nothing wrong with using graphical tools to configure and administer your system. These tools often make it easier to handle complex tasks and to get things working.
In a case of emergency, however, you will most likely not be able to access them. That's why Linux administration requires at least some rudimentary knowledge of using the 'traditional' command line Unix tools.

vi

The most basic rescue tool, like it or not, is the text editor. In most cases, and on the Mandrake Linux rescue system, that's the 'vi' editor, a minimal version of 'vim'.

You open a file with 'vi' thus:

vi file

If the file isn't in the current directory, you have to put the path to that file in front of the file name:

vi /some/path/file

'vi' now displays the file. To edit the displayed file, press the <i> key. You can then move around with cursor keys insert and delete text with the usual keys.
To save an edited file, press the <ESC> key and then press <Z> twice. Notice that 'vi' makes a difference between capital and small letters, so make sure you press a capital 'Z' twice here.
To quit editing a file without saving the changes you made to it, press these keys one after the other <ESC> <:> <q> <!> <ENTER>.

Of course there's much more to 'vi' and 'vim'. Even if you do not adopt it as your favorite editor, you should get comfortable with using its basic functions, since this training will be of use to you some day. You might want to read Vikram Vaswani's Vi 101 for an entertaining introduction into the Vi editor and print out Krissy J's Vi short command reference.

Notice:

mount

Although the current Mandrake Linux rescue system mounts your Linux partition automatically when started, there might be circumstances when you have to mount or unmount partitions or external media by hand, e.g. when your are using an older release or when your machine doesn't have a bootable CD drive, if you need to do a file system check or need access to external media.

Mounting is discussed in depth in another article in this section, but here are the basics.

In order to mount a medium, you need to know its device file name. If it is a partition on a hard disk, it's pretty easy to find out:

fdisk -l /dev/device

device stands for the device file name of the hard disk. In most case, it will be 'hda', the first IDE hard disk on the first IDE channel:

fdisk -l /dev/hda

This will list all partitions on that disk along with their device names. The second hard disk ('slave' on the first IDE channel) would be 'hdb', the first disk on the second channel 'hdc' etc. Notice that if you've got your hard disk connected to an UDMA-100 controller (on-board or card), the name of the first hard disk is 'hde'.
For more device file names read the article linked to above.

To mount a medium, you type

mount /dev/device mount_directory

mount_directory can be any existing directory on the current medium, preferably an empty one. To unmount a medium, a simple

umount mount_directory

suffices.

fsck

'fsck' a utility for performing file system checks and repairs. You start a file system check this way:

fsck -t file_system device_file

file_system has to be substituted by the file system on the partition you want to check. In contrast to 'mount', 'fsck' can't figure out the file system type on its own.
Instead of using the '-t file_system' option, you can also call the file system specific variants of 'fsck' directly: 'fsck.ext2', 'fsck.jfs', 'fsck.xfs' and 'reiserfsck'. This is actually the preferred way, since it eliminates the chance of getting options to 'fsck' and 'fsck.fs' mixed up.
Thus, to check the second primary partition on the first IDE hard drive which has the file system 'ext2':

fsck.ext2 /dev/hdb

Notice:

By default, a file system check will be run interactively, that is every time the checker encounters an error, it asks you if to fix it. If you find this annoying, you can turn off these questions with the '-a' option in all checkers (although 'fsck.ext2' prefers '-p').

Others

Make yourself familiar with these commands: 'mv', 'cp', 'rm', 'ls', 'cd', 'grep' and 'less'. This doesn't mean you should learn their man pages by heart, but you should know what they do and how to handle them. You will need them some day, believe me ;-).

The Mandrake Linux Rescue System

Mandrake Linux comes with a rescue system on the first CD (list of contents), introduced in release 7.1. In case your CD-R drive isn't bootable, boot from a boot floppy (images are in the '/images' directory of the first CD).
To start this rescue system, press the <F1> key and type rescue on the prompt at the bottom of the screen. Press the <ENTER> key. The rescue system will now boot from the CD, loading itself into system memory (at least 32 MB RAM needed).

Upon booting, the rescue system will automatically try to mount any available Linux hard disk partitions, which can then be accessed via the '/mnt' directory.
The rescue needs only system memory to work, which means you can remove the CD after boot (e.g. to use the drive to mount another CD).

The software contained in the rescue system allows you to

You see there's hardly an accident you can't fix with this system, provided you know how these tools work. And there's the catch: the rescue system does not contain any form of documentation, apart from the '--help' option which just displays an overview on the command's syntax, if at all.
You are not supposed to learn the options of these programs by heart, instead get a good and short command reference, like Hekman's 'Linux in a Nutshell' or Petron's 'Essential Reference'. If you can't spare the money, print out the man pages to the most important commands. There's also an online man page repository with a search interface.

Notice:

Booting 'failsafe'

'failsafe' is a standard boot option in all Mandrake Linux systems.

Under normal circumstances, the system switches right into the preferred runlevel during boot ('3' for console, '5' for X). 'failsafe' on the other hand first boots into runlevel 1 (Single User Mode, see below), then tries to switch to runlevel 3 (console) and then, if 5 is the default runlevel, into runlevel 5.
If the 'Linuxconf' administration suite is installed, it will be started in console mode upon reaching runlevel 1. You will be presented with a runlevel menu or the possibility to use 'Linuxconf' to do system maintenance tasks.

Single User Mode

Linux also provides two built-in rescue systems, one of them is the 'single user mode', aka runlevel 1. This 'single user' is 'root'. There will only be a minimum of processes running.

There are several ways to get into this runlevel:

The root Shell

The single user mode still relies on a working 'init'. But what if 'init' is corrupted or even missing? If you boot your system with this boot loader option:

linux init=/bin/sh

only the kernel will be loaded into system memory and you will be dropped almost immediately to a shell.

On a Mandrake Linux 8.1 system, you should add another option to turn off devfsd, because otherwise you will run into trouble with hardware related utilities:

linux init=/bin/sh devfs=nomount

Things you do not have initially in this shell:

The first thing you should try is getting write access to your '/' partition:

mount -o remount,rw /dev/device

Run mount to find out the name of device. Another file system you want to mount is the virtual 'proc' file system, which provides you and the system utilities with information about what's going on in your system:

mount /proc /proc -t proc

From here on you should be able to do your repair tasks. Your main objective should be getting init to work again, so that you can do further repairs in single user mode.

Before leaving this shell, flush all buffers with

sync

unmount all mounts with

umount -a

and remount the '/' mount read-only again with

mount -o remount,ro /

Press <ALT> <CTRL> <DEL> simultaneously to leave the shell and reboot the machine.

Notice:

Linux Systems On Other Media

There's quite a number of Linux distributions which run off a removable medium (floppy, CD, ZIP) or a Windows partition. CD based distributions often offer the added advantage of providing a graphical interface.

You'll find a fairly updated list of those Linux distributions on The LWN.net Linux Distribution List.

Things to keep in mind when using a third party rescue distribution:

The next two pages of this article will list some common (and some less common) emergency scenarios and describe how to handle them.

Scenario I: System Doesn't Boot

Usually this error is due to a simple boot loader misconfiguration. Your main priority is getting the system to boot again so that you can adapt your boot loader configuration.

If you have a current, working boot disk for your system, you are lucky ;-). If not, I'd suggest you create one right away. You can do that very easily via the Mandrake Control Center (Boot - Boot Disk).
If you prefer the command line:

mkbootdisk $(uname -r)

will do the same.
Boot with it to make sure it works.

If you are faced with a boot loader failure without having a boot floppy at hand, you have to start one of the external systems, preferably the Mandrake Linux rescue system, described on the previous page of this article and repair the boot loader configuration from outside (or at least create a working boot floppy).

When you are changing the configuration of the LiLo boot loader by editing '/etc/lilo.conf', you have to run the lilo afterward. But it has to be the 'lilo' on the hard drive, because you want to update the boot sector on that device. How to do that?
Simple. Enter the '/mnt' directory where the 'root' directory of your disk system is mounted to. Now change the 'root' directory with

chroot .

What does this do? When you are on the rescue system, your 'root' directory is that on the CD, with the system on the disk mounted to '/mnt'. With 'chroot' you basically switch your root directory to that on the disk. If you issue a command now, the disk version of this command will be executed, not the CD version. Execute

/sbin/lilo

and a new boot sector with the current configuration will be written to the master boot record. For GRUB, you'd likely execute something like

grub-install /dev/hda

although the device name might be different depending on your hardware setup. To switch your root directory back to the CD, type:

exit

Partition Table Destroyed Or Corrupted

If you can't fix your booting problems and 'cfdisk' as well as 'fdisk' tell you that there just isn't any partition table to read on your hard disk, chances are that table has been corrupted.

If the botched up table is not on the hard disk which contains your Linux installation, install the gpart partition table rescue utility and run it on the disk with the defunct boot record:

gpart /dev/device

where device is a whole disk device (e.g. hda or sda). This is just a scan to find out if 'gpart' can find any partitions at all (it usually does). Notice that this test can take quite a while and use up a considerable amount of system resources. If the findings of gpart look reasonable to you, tell it to write them to the boot record:

gpart -W /dev/device

Do not turn the computer off or kill the program until it is finished writing the table. 'gpart' may look sometimes like it was hanging, but it doesn't. Just wait. When finished, reboot.

If the partition table of the system disk has become unreadable, start from the Mandrake Linux rescue system. It contains a (undocumented) utility called 'rescuept' which ... well, I guess you can tell that by its name ;-). The first step is just like with 'gpart':

rescuept /dev/device

This will print 'rescuept's findings to the console. If these findings look reasonable, pipe them to another disk utility, 'sfdisk', which will write them to the boot sector of the hard disk:

rescuept /dev/device | sfdisk /dev/device

You want to make absolutely sure here that you use the same device name in both parts of the pipe ... When finished, knock on wood and reboot.

Super Block Damaged

Now that is really a rare emergency. From all the scenarios listed in this article, this is probably the only one which hasn't happened to me so far in my six years with Linux ;-).

The super block is the first block of each extfs2 partition. It contains important data about the file system like size, free space etc (it's similar to the File Allocation Table on FAT partitions). A partition with a damaged super block can't be mounted. Fortunately extfs2 keeps several super block backup copies.

  1. Boot your preferred emergency system.

  2. The backup copies are usually located at the beginning of each 8 KB (8192 bytes) block. So the next backup copy is in byte No. 8193.

  3. To restore the super block from this copy, enter the command

    e2fsck -b 8193 /dev/device

    If that block is damaged, too, try the next one at byte No. 16384 etc.

  4. Reboot.

Scenario II: System Stops During Boot

There are several critical steps where booting can fail.

Kernel Doesn't Load Properly

If this happens after a kernel upgrade, either a wrong boot loader configuration or misplaced symlinks in '/boot' are to blame. Boot another kernel or a rescue system and perform the steps outlined in the Kernel Upgrade article.

Boot Hangs On Rebuilding RPM database Or Finding Module Dependencies

If the system hangs during 'Rebuilding RPM database' or 'Finding module dependencies', just hit <CTRL> <c> simultaneously. This will skip this step and continue to boot.
Issue rpm --rebuilddb as 'root' if the hang was at 'Rebuilding RPM database'.
If your machine hangs at 'Finding module dependencies', you have most likely been through a kernel upgrade from source but haven't done it properly. Check if the files in '/boot' and the '/lib/modules' directory match the current kernel-version (i.e. have the current version number attached). Read the article on Upgrading The Kernel From Source for more.

Boot Hangs On RAMDISK: Compressed image found at block 0'

The system tries to load a RAM disk for a different kernel. Your boot loader configuration file points to a wrong or non-existent RAM disk (option initrd=). Boot another entry from your boot loader and create a RAM disk for your new kernel with 'mkinitrd' or use the 'Boot Config' module from the Mandrake Control Center, which automatically generates 'initrd' images and corresponding entries in the configuration file of your boot loader.
If you don't have another working entry to boot, use an external rescue system. See scenario I.

Boot Hangs On Kernel panic: VFS: Unable to mount root fs on xx:yy

The kernel tries to mount the 'root' partition but either doesn't find the necessary drivers or doesn't find the root partition.
If drivers necessary to access the root file system are built as kernel modules, these modules must be loaded via an 'init RAM disk' ('initrd'), referenced in your boot loader configuration file. Notice that access to non-ext2 filesystems like Reiserfs, XFS or JFS also requires modules and thus a RAM disk. See previous entry.

If the kernel can't find the root partition, check your boot loader configuration, especially the 'root' option.

File System Check Fails

If the system encounters a medium which hasn't been properly unmounted, it will run a routine file system check (fsck) or, if you use a journaling file system (default in Mandrake Linux 8), a journal recovery during the next mount of that medium.

If the file system does not feature a journal, 'fsck' will check it for consistency and delete or move empty or inconsistent data. You will find that data later in the 'lost+found' directory of the fsck'ed partition.

'fsck' will fix most errors by itself. If it comes to deleting data, however, 'fsck' will quit and you will be dropped to a root shell. Run 'fsck' again by hand on the device, where the automatic 'fsck' failed

e2fsck /dev/device

This will start 'fsck' in interactive mode and you will be prompted for each action 'fsck' wants to make. If you are not a file system guru, you might be better off to let 'fsck' do what it thinks is best:

e2fsck /dev/device

The '-p' option tells 'e2fsck' to do all the necessary repairs without asking, '-y' assumes the answer 'yes' to all questions.
When the check and repair is over, hit CTRL-D to leave the emergency console. The system will reboot.
The first thing you should do when the system has rebooted is backing up all important data to an external medium immediately. Have a look at the 'lost+found' directories on your system. These might contain '#' files. These files have been moved to these directories to improve the consistency of the file system. Which means that these files can be important system configuration files.

Scenario III: Login Fails

Before panicking, make sure that you have not just fallen victim to a typing error: check if 'capslock' is on, try different capitalization, try to login on another account or terminal (switch with ALT-F2) etc.

A failed login might be due to a wrong / corrupt entry in either '/etc/passwd', '/etc/shadow' or '/etc/securetty', wrong file permissions or a forgotten 'root' password.

  1. In order to get into the system, reboot and type

    linux init 1

    at the LiLo boot prompt. If you are using GNU GRUB, hit the 'e' key twice and add

    init 1

    to the boot command and then ENTER and 'b' to boot.
    This will boot the system into single user mode.

  2. By default Linux keeps backups of '/etc/shadow' and '/etc/passwd', called '/etc/passwd-' and '/etc/shadow-'. Your first line of rescue is using these backup files.

    1. Backup the current 'shadow' and 'passwd' files:

      cp /etc/shadow /etc/shadow.old
      cp /etc/passwd /etc/passwd.old

    2. Now overwrite them with their system backups:

      cp /etc/passwd- /etc/passwd
      cp /etc/shadow- /etc/shadow

    3. Try to switch to runlevel three to see if you can log into the system now:

      init 3

  3. If this approach doesn't work, reboot into runlevel 1 again (press <ALT> <CTRL> <BACKSPACE> simultaneously).
  4. Once the system is up, type

    vi /etc/passwd

    Have a look at this file. It mustn't contain any blank lines, comments or non-ASCII characters. If you find them, delete them. The entry for 'root' must look exactly like this:

    root:x:0:0:root:/root:/bin/bash

    If it does not, change it and save the file.
    Run

    chmod 644 /etc/passwd

  5. Next, run vi /etc/shadow.

    The format of entries in '/etc/shadow' is

    account_name:password:other stuff e.g.
    root:$1$KODLGetc:10979:0:99999:7:::

    The password entry is encrypted, of course.
    Delete the password entry for 'root' by moving the cursor to first character of the password (usually the first '$') and typing dw. Now type :wq to save the file.

  6. Also have a look at '/etc/securetty' (more /etc/securetty), which should contain these entries:

    tty1
    tty2
    tty3
    tty4
    tty5
    tty6

  7. Other things to check include having a look at '/var/log/messages', which might reveal something about the nature of your problems with logging in, and checking the ownership and permissions (ls -al) of '/root/.bash_profile', '/root/.bashrc' and '/etc/gettydefs'. All these files must belong to 'root' and must be readable and writable for him.

  8. Reboot with init 6

  9. On the next login, type root for the account name and just hit <ENTER> at the 'password' prompt.

  10. Once you are logged in, type passwd to give 'root' a new password.

If you still can't get into your system, there's something deeply mysterious going on. This might be one of the few cases where reinstalling might solve the problem.

Scenario IV: System Hangs On Loading X

If you have configured your machine to boot directly into graphics mode, configuration problems with your X server can prevent you from logging in.
Press <CTRL> <ALT> <F2> to log into another console, kill the process which tries to start the X server and perform the troubleshooting steps outlined in the article on X Setup Troubles.
Alternatively, reboot and use the boot loader option

linux init 3

to boot to the console while X isn't working.

Scenario V: System Freeze

Silent interruptions, commonly called 'hang' or 'freeze', are usually caused by some problem of the operating system with the hardware it is running on (bad memory chips, driver bugs, IRQ conflicts etc). These interruptions usually do not leave a trace in the system's log files in '/var/log' and require either a software update or a hardware change.

Your main task in such a situation is to prevent further damage, e.g. file system corruption by just turning the computer off. The 'Magic SysRq Key' feature comes in handy here.

The Magic SysRq Key

This feature allows you to do some basic maintenance tasks even if the rest of the system isn't responding. It is enabled by default on Mandrake Linux. In particular, it allows you to shutdown your system properly, thus avoiding the risk of file system corruption when simply turning the machine off with media still being mounted.

The 'SysRq' sequence involves pressing three keys at once, the left ALT key, the 'SysRq' key (also labeled 'PrtSc' or 'F13') and a letter key:

  1. <ALT> <SysRq> <r> puts the keyboard in 'raw' mode.
    This might be helpful in cases where the graphical interface does not respond to keyboard or mouse commands any more. Having pressed that sequence, press <ALT> <CTRL> <BACKSPACE> simultaneously. This will try to kill the X server and drops you onto the console (i.e. it's the emergency key combination to switch from runlevel 5 to runlevel 3).

  2. <ALT> <SysRq> <s> attempts to write all unsaved data to disk ('sync' the disk) to prevent file corruption.

  3. <ALT> <SysRq> <e> sends a termination signal to all processes, except for 'init'.

  4. <ALT> <SysRq> <i> sends a kill signal to all processes, except for init, thus terminating all processes which ignored the termination signal.

  5. <ALT> <SysRq> <u> remounts all mounted file systems read-only. This prevents file system corruption.

  6. <ALT> <SysRq> <b> reboots the system. Alternatively, replace the 'b' with an 'o' to turn the machine off.

If you look at this sequence, you see that you are - apart from the first step - actually emulating the 'init' shutdown process. Therefore it is important that you press these sequences in the correct order (e.g. that you 'sync' the drives before remounting them): Raw - Sync - tErm - kIll - Umount - reBoot. A possible mnemonic phrase: 'Raising Skinny Elephants Is Utterly Boring'. Mandrake Linux user Louis suggested this phrase, which is a bit more on topic: 'Remembering the Sequence Entirely Is Useful Buddy'.

Via A Network

If your machine runs a telnet or SSH server, you should try to log into the frozen system from another machine. There are cases when just the graphical interface is frozen but the basic system and network services are still working.

Scenario VI: Important Files Deleted

Do yourself a favour and get the Recover undeletion utility ( PPC Mandrake RPM, x86 Mandrake RPM), which makes file recovery a lot easier (it acts as a front end to the 'debugfs' tool). All you have to do is point it to the partition where that file was (as 'root'):

recover /dev/device

'recover' will ask you a row of questions to get the most possible deletion date, thus minimizing the files you'll have to look through later. Notice that 'recover' does recover the content of a file, but not it's name, therefore it is in your own best interest to provide as much data to it as possible.


Legal: This text is covered by the GNU Free Documentation License. Standard disclaimers of warranty apply. Copyright LSTB (Tom Berger) and Mandrakesoft 1999-2002.