Programming hints and tips

How Does Windows NT handle a Disk Crash?

Everybody who uses a computer knows that you should periodically back up all your information "just in case the disk crashes". Well, it finally happened to me. But this column isn't another lecture on maintaining proper backups. This column is about what happened during the disk crash, and how my operating system (Windows NT) handled the situation. I haven't been running NT very long, only about six months. It is heralded by many as the best operating system currently available for PCs.

The disk that crashed was not my operating system disk, but a data disk (D:). The problem started a couple of days before the actual crash, when the system seemed to hang for a moment when accessing the disk. The day of the crash, the hangs started happening more frequently. It got to the point where saving a file was taking a long time, and the normal swapping activity caused frequent noticeable delays in other activities.

I became worried that something wasn't right and I started copying critical files off the disk. After I did the backups, I power-cycled the computer, on the theory that maybe the drive controller needed to be reset.

That's when things showed how bad they were! During the boot process, the NT operating system starts by checking the disks (in case the system was not properly shut down). This time, during the checks, the computer started finding bad blocks all over the disk. It started trying to repair the damage by replacing the bad blocks and by deleting unrecoverable files. That boot took over half an hour! As soon as the computer started deleting files, I knew I was in trouble, but what could I do? The process was not interruptable.

By the time it was finished, the system had automatically deleted 633,995,264 bytes in 8940 files (according to its summary). I had tried to jot down which directories were affected as it went through deleting the files, and it turned out it was a good thing I did.

After going through my disk, deleting or corrupting (sorry, "correcting") thousands of files, NT continued the boot as though nothing had happened! Thousands of my files had been destroyed, but the boot continued and the system came up. If I had not been there watching it, there would have been no indication that all this damage had occurred. Surely, it should have at least popped up an error message saying something like "Drive D: had errors during last boot, see Drive D Errors.Log for details."

It would also be nice if, having seen all that damage on the D: drive, it hadn't mounted the D: drive. But, no, not only did it mount the drive, it still used it for some mysterious caching, which, of course, led to even more disk errors. I did find the event log which told me that there were troubles with the drive. It told me that there were disk errors of type 26 (or some other number), but it gave me no further information.

The next day, I powered up the computer again, in order to try to recover some of my hobby files. This time the damage detected during the start up was more severe, and the check process went through a seemingly endless stream of files which were corrupted and needed to be "corrected". The entire boot process took over 15 hours! (About 2 hours of that was spent correcting one file!) Still, when it finally came up, it just blanked the screen and carried on, with still no warning to the user that one of the disks was basically dead. And it still tried to access that disk every few minutes, even though nothing was running.

Overall, I'm left with a slightly queasy feeling about an operating system which can:

Given these characteristics, I would have to have severe reservations about recommending NT as the basis for any business-critical applications, especially if they involve long term, unattended, operation. It seems there's too much chance that it can fail silently, and that users or client machines can take what they believe are intact files, and discover too late that some of the files have been corrupted. And the fact that it can take 15 hours to boot after failure means that it is not reliable enough to be used for process control.

I'm mystified by how professional designers could make a high-quality operating system that works this way. Has anyone out there had a similar experience? or a vastly different one? Any NT experts like to contribute any insights? Anyone coming from another operating system like to comment on how their system would handle it? Write to Rob with your experiences.


Want to:
Read more from the soapbox
- Read Awk Words.
- Look at my AWK scripts.
- Read Robert's Rules Of Coding.
- Go back to the front gate.
- Visit another wagon.

Whenever you find yourself at the bottom of a deep hole, the first thing you should do is stop digging.
Page maintained by Rob.