Muq Administration - Crash Recovery Mechanics

Go to the first, previous, next, last section, table of contents.

Crash Recovery

Once you have attempted diagnosis and convinced yourself that you've either resolved the problem or else have no clue what else to do, it is time to try getting your Muq server back on the air.

The server crash probably left a bunch of '*-RUNNING-*.db' files. These are useless: they are undoubtedly in a corrupt state. Delete them. (If you're feeling cautious, you might move them to an archive subdirectory somewhere, if you haven't already archived the entire directory. They might possibly will come in useful in later detective work.)

Nine times out of ten, you will now be able to restart the server without further incident. (Until the same problem repeats, whatever it was.)

Occasionally, the '*CURRENT*.db.*' files will also be corrupt (meaning dying server managed to complete a backup before dying completely). In that case, you'll want to delete (or archive) those files as well, and rename the most recent numeric backup set to be the '*CURRENT*.db.*' fileset.

If you're really unlucky, you may have to go back more than one generation before finding a good backup set, but that is unusual. You should quickly start suspecting that your hardware has gone bad, or that your Muq executable image has somehow gotten corrupted: You may wish to recompile or redownload (or checksum) it, and verify that other large programs are working correctly on your machine.

A good check is to compile Muq from source and then run "make check". If you suspect you have an erratic ram failure that shows up only every hour or two, you may wish in tcsh to do something like

while (1)
muq-distclean
make
make check
date
end

and leave it running overnight: If that crashes on clean Muq source, you almost certainly have a hardware problem of some sort. If it runs a full day without problems, your hardware is probably pretty healthy.

Go to the first, previous, next, last section, table of contents.