Revision / Modified: Jan. 28, 2002
Author: Tom Berger
Original documents:
http://www.mandrakeuser.org/docs/admin/arecov.html
http://www.mandrakeuser.org/docs/admin/arecov2.html
http://www.mandrakeuser.org/docs/admin/arecov3.html
http://www.mandrakeuser.org/docs/admin/arecov4.html
Very often less experienced Linux users counter system problems with the
troubleshooting technique they've learned on other operating systems: they
reinstall the whole operating system.
This of course does not only cost quite a massive account of time, it also
prevents them to feel really secure with the system, all the more since these
users usually do not keep backups.
There are very few compelling reasons to ever reinstall Linux, like massive file system corruption or a hard drive failure. You can repair everything else either from within the system or from outside (via a network connection or a boot disk/CD). In contrast to other operating systems all configuration files are in plain text and can be edited with the most simple text editor. Furthermore you can reinstall, upgrade or downgrade every part of the system, since Linux only needs a minimal set of files to provide the basic functions of an operating system.
Being able to repair a system is what makes you an administrator, it is arguably the most important step from being controlled by your system to take control of it.
There is nothing wrong with using graphical tools to configure and
administer your system. These tools often make it easier to handle complex
tasks and to get things working.
In a case of emergency, however, you will most likely not be able to access
them. That's why Linux administration requires at least some rudimentary
knowledge of using the 'traditional' command line Unix tools.
The most basic rescue tool, like it or not, is the text editor. In most cases, and on the Mandrake Linux rescue system, that's the 'vi' editor, a minimal version of 'vim'.
You open a file with 'vi' thus:
vi file
If the file isn't in the current directory, you have to put the path to that file in front of the file name:
vi /some/path/file
'vi' now displays the file. To edit the displayed file, press the
<i> key. You can then move around with cursor keys insert and
delete text with the usual keys.
To save an edited file, press the <ESC> key and then press
<Z> twice. Notice that 'vi' makes a difference between
capital and small letters, so make sure you press a capital 'Z' twice
here.
To quit editing a file without saving the changes you made to it, press these
keys one after the other <ESC> <:> <q> <!>
<ENTER>.
Of course there's much more to 'vi' and 'vim'. Even if you do not adopt it as your favorite editor, you should get comfortable with using its basic functions, since this training will be of use to you some day. You might want to read Vikram Vaswani's Vi 101 for an entertaining introduction into the Vi editor and print out Krissy J's Vi short command reference.
Notice:
The 'vi' command does not work on Mandrake Linux when the 'vim-enhanced'
package is installed and you are working on a partition which only contains
the root directory with no other partitions mounted.
Use vim-minimal instead or run
# update-alternatives --config vi
to set the name of the executed binary back to '/bin/vim-minimal'.
Although the current Mandrake Linux rescue system mounts your Linux partition automatically when started, there might be circumstances when you have to mount or unmount partitions or external media by hand, e.g. when your are using an older release or when your machine doesn't have a bootable CD drive, if you need to do a file system check or need access to external media.
Mounting is discussed in depth in another article in this section, but here are the basics.
In order to mount a medium, you need to know its device file name. If it is a partition on a hard disk, it's pretty easy to find out:
fdisk -l /dev/device
device stands for the device file name of the hard disk. In most case, it will be 'hda', the first IDE hard disk on the first IDE channel:
fdisk -l /dev/hda
This will list all partitions on that disk along with their device names.
The second hard disk ('slave' on the first IDE channel) would be 'hdb', the
first disk on the second channel 'hdc' etc. Notice that if you've got your
hard disk connected to an UDMA-100 controller (on-board or card), the name of
the first hard disk is 'hde'.
For more device file names read the article linked to above.
To mount a medium, you type
mount /dev/device mount_directory
mount_directory can be any existing directory on the current medium, preferably an empty one. To unmount a medium, a simple
umount mount_directory
suffices.
'fsck' a utility for performing file system checks and repairs. You start a file system check this way:
fsck -t file_system device_file
file_system has to be substituted by the file system on the
partition you want to check. In contrast to 'mount', 'fsck' can't figure out
the file system type on its own.
Instead of using the '-t file_system' option, you can also call the
file system specific variants of 'fsck' directly: 'fsck.ext2', 'fsck.jfs',
'fsck.xfs' and 'reiserfsck'. This is actually the preferred way, since it
eliminates the chance of getting options to 'fsck' and 'fsck.fs'
mixed up.
Thus, to check the second primary partition on the first IDE hard drive which
has the file system 'ext2':
fsck.ext2 /dev/hdb
Notice:
By default, a file system check will be run interactively, that is every time the checker encounters an error, it asks you if to fix it. If you find this annoying, you can turn off these questions with the '-a' option in all checkers (although 'fsck.ext2' prefers '-p').
Make yourself familiar with these commands: 'mv', 'cp', 'rm', 'ls', 'cd', 'grep' and 'less'. This doesn't mean you should learn their man pages by heart, but you should know what they do and how to handle them. You will need them some day, believe me ;-).
Mandrake Linux comes with a rescue system on the first CD (list of contents), introduced in release 7.1. In
case your CD-R drive isn't bootable, boot from a boot floppy (images are in
the '/images' directory of the first CD).
To start this rescue system, press the <F1> key and type
rescue on the prompt at the bottom of the screen. Press the
<ENTER> key. The rescue system will now boot from the CD,
loading itself into system memory (at least 32 MB RAM needed).
Upon booting, the rescue system will automatically try to mount any
available Linux hard disk partitions, which can then be accessed via the
'/mnt' directory.
The rescue needs only system memory to work, which means you can remove the CD
after boot (e.g. to use the drive to mount another CD).
The software contained in the rescue system allows you to
You see there's hardly an accident you can't fix with this system, provided
you know how these tools work. And there's the catch: the rescue system does
not contain any form of documentation, apart from the '--help' option which
just displays an overview on the command's syntax, if at all.
You are not supposed to learn the options of these programs by heart, instead
get a good and short command reference, like Hekman's 'Linux in a Nutshell' or
Petron's 'Essential Reference'. If you can't spare the money, print out the
man pages to the most important commands. There's also an online man page repository with a search
interface.
Notice:
'failsafe' is a standard boot option in all Mandrake Linux systems.
Under normal circumstances, the system switches right into the preferred runlevel during boot ('3' for
console, '5' for X). 'failsafe' on the other hand first boots into runlevel 1
(Single User Mode, see below), then tries to switch to runlevel 3 (console)
and then, if 5 is the default runlevel, into runlevel 5.
If the 'Linuxconf' administration suite is installed, it will be started in
console mode upon reaching runlevel 1. You will be presented with a runlevel
menu or the possibility to use 'Linuxconf' to do system maintenance tasks.
Linux also provides two built-in rescue systems, one of them is the 'single user mode', aka runlevel 1. This 'single user' is 'root'. There will only be a minimum of processes running.
There are several ways to get into this runlevel:
From within a running system (as 'root'): init 1. Notice that this command will shutdown almost everything on your machine. It's also a popular way to simulate a reboot.
From the prompt of a boot loader: linux single or linux
init 1. You might also be dropped off here when using the 'failsafe'
boot option if the system can't go to runlevel 3.
There's no login required.
The single user mode still relies on a working 'init'. But what if 'init' is corrupted or even missing? If you boot your system with this boot loader option:
linux init=/bin/sh
only the kernel will be loaded into system memory and you will be dropped almost immediately to a shell.
On a Mandrake Linux 8.1 system, you should add another option to turn off devfsd, because otherwise you will run into trouble with hardware related utilities:
linux init=/bin/sh devfs=nomount
Things you do not have initially in this shell:
The first thing you should try is getting write access to your '/' partition:
mount -o remount,rw /dev/device
Run mount to find out the name of device. Another file system you want to mount is the virtual 'proc' file system, which provides you and the system utilities with information about what's going on in your system:
mount /proc /proc -t proc
From here on you should be able to do your repair tasks. Your main objective should be getting init to work again, so that you can do further repairs in single user mode.
Before leaving this shell, flush all buffers with
sync
unmount all mounts with
umount -a
and remount the '/' mount read-only again with
mount -o remount,ro /
Press <ALT> <CTRL> <DEL> simultaneously to leave the shell and reboot the machine.
Notice:
There's quite a number of Linux distributions which run off a removable medium (floppy, CD, ZIP) or a Windows partition. CD based distributions often offer the added advantage of providing a graphical interface.
You'll find a fairly updated list of those Linux distributions on The LWN.net Linux Distribution List.
Things to keep in mind when using a third party rescue distribution:
Make sure it works. If you've downloaded a CD image, run
md5sum name.iso
and compare the resulting number with the one provided on the server you
downloaded the image from.
Floppy images are less prone to transmission CRC errors but media failures are
much more likely. Having put the image onto a floppy, run
cmp /dev/fd0 name.img
to make sure the image on the floppy and the image you've downloaded are
identical.
Boot your new rescue system to check if everything's OK.
Check if the distribution is actively maintained. Linux is a fast moving target. If you can't mount your hard disk partition because of an out-of-date file system driver on your rescue system, you're back to square one.
Have a look at the included software. If your partitions are formatted with a less common file system like XFS or JFS, it might happen that the distribution does not contain the necessary utilities. Use the content list of the Mandrake Linux rescue system as a template.
The next two pages of this article will list some common (and some less common) emergency scenarios and describe how to handle them.
Usually this error is due to a simple boot loader misconfiguration. Your main priority is getting the system to boot again so that you can adapt your boot loader configuration.
If you have a current, working boot disk for your system, you are lucky
;-). If not, I'd suggest you create one right away. You can do that very
easily via the Mandrake Control Center (Boot - Boot Disk).
If you prefer the command line:
mkbootdisk $(uname -r)
will do the same.
Boot with it to make sure it works.
If you are faced with a boot loader failure without having a boot floppy at hand, you have to start one of the external systems, preferably the Mandrake Linux rescue system, described on the previous page of this article and repair the boot loader configuration from outside (or at least create a working boot floppy).
When you are changing the configuration of the LiLo boot loader by editing
'/etc/lilo.conf', you have to run the lilo afterward. But it has to
be the 'lilo' on the hard drive, because you want to update the boot
sector on that device. How to do that?
Simple. Enter the '/mnt' directory where the 'root' directory of your disk
system is mounted to. Now change the 'root' directory with
chroot .
What does this do? When you are on the rescue system, your 'root' directory is that on the CD, with the system on the disk mounted to '/mnt'. With 'chroot' you basically switch your root directory to that on the disk. If you issue a command now, the disk version of this command will be executed, not the CD version. Execute
/sbin/lilo
and a new boot sector with the current configuration will be written to the master boot record. For GRUB, you'd likely execute something like
grub-install /dev/hda
although the device name might be different depending on your hardware setup. To switch your root directory back to the CD, type:
exit
If you can't fix your booting problems and 'cfdisk' as well as 'fdisk' tell you that there just isn't any partition table to read on your hard disk, chances are that table has been corrupted.
If the botched up table is not on the hard disk which contains your Linux installation, install the gpart partition table rescue utility and run it on the disk with the defunct boot record:
gpart /dev/device
where device is a whole disk device (e.g. hda or sda). This is just a scan to find out if 'gpart' can find any partitions at all (it usually does). Notice that this test can take quite a while and use up a considerable amount of system resources. If the findings of gpart look reasonable to you, tell it to write them to the boot record:
gpart -W /dev/device
Do not turn the computer off or kill the program until it is finished writing the table. 'gpart' may look sometimes like it was hanging, but it doesn't. Just wait. When finished, reboot.
If the partition table of the system disk has become unreadable, start from the Mandrake Linux rescue system. It contains a (undocumented) utility called 'rescuept' which ... well, I guess you can tell that by its name ;-). The first step is just like with 'gpart':
rescuept /dev/device
This will print 'rescuept's findings to the console. If these findings look reasonable, pipe them to another disk utility, 'sfdisk', which will write them to the boot sector of the hard disk:
rescuept /dev/device | sfdisk /dev/device
You want to make absolutely sure here that you use the same device name in both parts of the pipe ... When finished, knock on wood and reboot.
Now that is really a rare emergency. From all the scenarios listed in this article, this is probably the only one which hasn't happened to me so far in my six years with Linux ;-).
The super block is the first block of each extfs2 partition. It contains important data about the file system like size, free space etc (it's similar to the File Allocation Table on FAT partitions). A partition with a damaged super block can't be mounted. Fortunately extfs2 keeps several super block backup copies.
Boot your preferred emergency system.
The backup copies are usually located at the beginning of each 8 KB (8192 bytes) block. So the next backup copy is in byte No. 8193.
To restore the super block from this copy, enter the command
e2fsck -b 8193 /dev/device
If that block is damaged, too, try the next one at byte No. 16384 etc.
Reboot.
There are several critical steps where booting can fail.
If this happens after a kernel upgrade, either a wrong boot loader configuration or misplaced symlinks in '/boot' are to blame. Boot another kernel or a rescue system and perform the steps outlined in the Kernel Upgrade article.
If the system hangs during 'Rebuilding RPM database' or 'Finding module
dependencies', just hit <CTRL> <c> simultaneously. This
will skip this step and continue to boot.
Issue rpm --rebuilddb as 'root' if the hang was at 'Rebuilding RPM
database'.
If your machine hangs at 'Finding module dependencies', you have most likely
been through a kernel upgrade from source but haven't done it properly. Check
if the files in '/boot' and the '/lib/modules' directory match the current
kernel-version (i.e. have the current version number attached). Read the
article on Upgrading The
Kernel From Source for more.
The system tries to load a RAM disk for a different kernel. Your boot
loader configuration file points to a wrong or non-existent RAM disk (option
initrd=
). Boot another entry from your boot loader and create a
RAM disk for your new kernel with 'mkinitrd' or use the 'Boot Config' module
from the Mandrake Control Center, which automatically generates 'initrd'
images and corresponding entries in the configuration file of your boot
loader.
If you don't have another working entry to boot, use an external rescue
system. See scenario I.
The kernel tries to mount the 'root' partition but either doesn't find the
necessary drivers or doesn't find the root partition.
If drivers necessary to access the root file system are built as kernel
modules, these modules must be loaded via an 'init RAM disk'
('initrd'), referenced in your boot loader configuration file. Notice that
access to non-ext2 filesystems like Reiserfs, XFS or JFS also requires modules
and thus a RAM disk. See previous entry.
If the kernel can't find the root partition, check your boot loader configuration, especially the 'root' option.
If the system encounters a medium which hasn't been properly unmounted, it will run a routine file system check (fsck) or, if you use a journaling file system (default in Mandrake Linux 8), a journal recovery during the next mount of that medium.
If the file system does not feature a journal, 'fsck' will check it for consistency and delete or move empty or inconsistent data. You will find that data later in the 'lost+found' directory of the fsck'ed partition.
'fsck' will fix most errors by itself. If it comes to deleting data, however, 'fsck' will quit and you will be dropped to a root shell. Run 'fsck' again by hand on the device, where the automatic 'fsck' failed
e2fsck /dev/device
This will start 'fsck' in interactive mode and you will be prompted for each action 'fsck' wants to make. If you are not a file system guru, you might be better off to let 'fsck' do what it thinks is best:
e2fsck /dev/device
The '-p' option tells 'e2fsck' to do all the necessary repairs without
asking, '-y' assumes the answer 'yes' to all questions.
When the check and repair is over, hit CTRL-D to leave the emergency console.
The system will reboot.
The first thing you should do when the system has rebooted is backing up all
important data to an external medium immediately. Have a look at the
'lost+found' directories on your system. These might contain '#' files. These
files have been moved to these directories to improve the consistency of the
file system. Which means that these files can be important system
configuration files.
Before panicking, make sure that you have not just fallen victim to a typing error: check if 'capslock' is on, try different capitalization, try to login on another account or terminal (switch with ALT-F2) etc.
A failed login might be due to a wrong / corrupt entry in either '/etc/passwd', '/etc/shadow' or '/etc/securetty', wrong file permissions or a forgotten 'root' password.
In order to get into the system, reboot and type
linux init 1
at the LiLo boot prompt. If you are using GNU GRUB, hit the 'e' key twice and add
init 1
to the boot command and then ENTER and 'b' to boot.
This will boot the system into single user mode.
By default Linux keeps backups of '/etc/shadow' and '/etc/passwd', called '/etc/passwd-' and '/etc/shadow-'. Your first line of rescue is using these backup files.
Backup the current 'shadow' and 'passwd' files:
cp /etc/shadow /etc/shadow.old
cp /etc/passwd /etc/passwd.old
Now overwrite them with their system backups:
cp /etc/passwd- /etc/passwd
cp /etc/shadow- /etc/shadow
Try to switch to runlevel three to see if you can log into the system now:
init 3
Once the system is up, type
vi /etc/passwd
Have a look at this file. It mustn't contain any blank lines, comments or non-ASCII characters. If you find them, delete them. The entry for 'root' must look exactly like this:
root:x:0:0:root:/root:/bin/bash
If it does not, change it and save the file.
Run
chmod 644 /etc/passwd
Next, run vi /etc/shadow.
The format of entries in '/etc/shadow' is
account_name:password:other
stuff
e.g.
root:$1$KODLGetc:10979:0:99999:7:::
The password entry is encrypted, of course.
Delete the password entry for 'root' by moving the cursor to first character
of the password (usually the first '$') and typing dw. Now type
:wq to save the file.
Also have a look at '/etc/securetty' (more /etc/securetty), which should contain these entries:
tty1
tty2
tty3
tty4
tty5
tty6
Other things to check include having a look at '/var/log/messages', which might reveal something about the nature of your problems with logging in, and checking the ownership and permissions (ls -al) of '/root/.bash_profile', '/root/.bashrc' and '/etc/gettydefs'. All these files must belong to 'root' and must be readable and writable for him.
Reboot with init 6
On the next login, type root for the account name and just hit <ENTER> at the 'password' prompt.
Once you are logged in, type passwd to give 'root' a new password.
If you still can't get into your system, there's something deeply mysterious going on. This might be one of the few cases where reinstalling might solve the problem.
If you have configured your machine to boot directly into graphics mode,
configuration problems with your X server can prevent you from logging
in.
Press <CTRL> <ALT> <F2> to log into another
console, kill the process which
tries to start the X server and perform the troubleshooting steps outlined in
the article on X Setup
Troubles.
Alternatively, reboot and use the boot loader option
linux init 3
to boot to the console while X isn't working.
Silent interruptions, commonly called 'hang' or 'freeze', are usually caused by some problem of the operating system with the hardware it is running on (bad memory chips, driver bugs, IRQ conflicts etc). These interruptions usually do not leave a trace in the system's log files in '/var/log' and require either a software update or a hardware change.
Your main task in such a situation is to prevent further damage, e.g. file system corruption by just turning the computer off. The 'Magic SysRq Key' feature comes in handy here.
This feature allows you to do some basic maintenance tasks even if the rest of the system isn't responding. It is enabled by default on Mandrake Linux. In particular, it allows you to shutdown your system properly, thus avoiding the risk of file system corruption when simply turning the machine off with media still being mounted.
The 'SysRq' sequence involves pressing three keys at once, the left ALT key, the 'SysRq' key (also labeled 'PrtSc' or 'F13') and a letter key:
<ALT> <SysRq> <r> puts the keyboard in 'raw'
mode.
This might be helpful in cases where the graphical interface does not respond
to keyboard or mouse commands any more. Having pressed that sequence, press
<ALT> <CTRL> <BACKSPACE> simultaneously. This
will try to kill the X server and drops you onto the console (i.e. it's the
emergency key combination to switch from runlevel 5 to runlevel 3).
<ALT> <SysRq> <s> attempts to write all unsaved data to disk ('sync' the disk) to prevent file corruption.
<ALT> <SysRq> <e> sends a termination signal to all processes, except for 'init'.
<ALT> <SysRq> <i> sends a kill signal to all processes, except for init, thus terminating all processes which ignored the termination signal.
<ALT> <SysRq> <u> remounts all mounted file systems read-only. This prevents file system corruption.
<ALT> <SysRq> <b> reboots the system. Alternatively, replace the 'b' with an 'o' to turn the machine off.
If you look at this sequence, you see that you are - apart from the first
step - actually emulating the 'init' shutdown process. Therefore it is
important that you press these sequences in the correct order (e.g. that you
'sync' the drives before remounting them): Raw -
Sync - tErm - kIll -
Umount - reBoot. A possible mnemonic phrase:
'Raising Skinny Elephants Is Utterly Boring'. Mandrake Linux user Louis
suggested this phrase, which is a bit more on topic: 'Remembering the Sequence
Entirely Is Useful Buddy'.
If your machine runs a telnet or SSH server, you should try to log into the frozen system from another machine. There are cases when just the graphical interface is frozen but the basic system and network services are still working.
Do yourself a favour and get the Recover undeletion utility ( PPC Mandrake RPM, x86 Mandrake RPM), which makes file recovery a lot easier (it acts as a front end to the 'debugfs' tool). All you have to do is point it to the partition where that file was (as 'root'):
recover /dev/device
'recover' will ask you a row of questions to get the most possible deletion date, thus minimizing the files you'll have to look through later. Notice that 'recover' does recover the content of a file, but not it's name, therefore it is in your own best interest to provide as much data to it as possible.