I fail at rebooting
Aug. 16th, 2008 04:41 pmMy FreeBSD server (the one in the co-lo that I use for most email, IRC, and other things) is down right now. It was messed up and wouldn’t respond to the “reboot” command. I had 5-6 of them queued up but they did nothing. I tried everything includg killing individual processes, etc. “kill -1 1” and “kill 1” didn’t do anything. Old Unix had dangerous commands like “crash” and “panic” which would have gotten me out of this situation, but alas those have been removed (and for good reason).
The problem started at exactly midnight when a backup process tried to do a UFS snapshot before backups started. I can see the snapshot command is stuck. I’m sure this is a file system bug and now everything else that tries to touch too much of the file system also gets stuck.
Here’s a Unix tip: when you run a command that might not give you your prompt back, put it in the background by pressing “&” before you press return.
I’d been putting nearly every command I typed in the background. Even “ps” and “ls”.
I have remote access to the serial console, so I’m in pretty good shape as things go wrong on the machine. I can still be “root”. However, issuing the “reboot” command does nothing.
After a half hour or more of not being able to reboot this machine, I remember the “init” command might do something. Sadly, the man page wouldn’t display right, so instead of finding another FreeBSD machine to be sure, I type “init q” thinking it was the “quick reboot”. No, that resets all the gettys, which will never complete because of the state this machine is in. (the correct command is “init 6” which would have rebooted it, I think).
Of course this is the one command I didn’t put in the background. It is also one of the few commands that can’t be CTRL-Z’ed or CTRL-C’ed or anything. So my machine is stuck saying that it trying to kill process 718 over and over again. This is on the serial console. No hope for fixing it.
I need someone to visit the colo and physically power the machine off and back on.
I fail at rebooting.
The problem started at exactly midnight when a backup process tried to do a UFS snapshot before backups started. I can see the snapshot command is stuck. I’m sure this is a file system bug and now everything else that tries to touch too much of the file system also gets stuck.
Here’s a Unix tip: when you run a command that might not give you your prompt back, put it in the background by pressing “&” before you press return.
I’d been putting nearly every command I typed in the background. Even “ps” and “ls”.
I have remote access to the serial console, so I’m in pretty good shape as things go wrong on the machine. I can still be “root”. However, issuing the “reboot” command does nothing.
After a half hour or more of not being able to reboot this machine, I remember the “init” command might do something. Sadly, the man page wouldn’t display right, so instead of finding another FreeBSD machine to be sure, I type “init q” thinking it was the “quick reboot”. No, that resets all the gettys, which will never complete because of the state this machine is in. (the correct command is “init 6” which would have rebooted it, I think).
Of course this is the one command I didn’t put in the background. It is also one of the few commands that can’t be CTRL-Z’ed or CTRL-C’ed or anything. So my machine is stuck saying that it trying to kill process 718 over and over again. This is on the serial console. No hope for fixing it.
I need someone to visit the colo and physically power the machine off and back on.
I fail at rebooting.
no subject
Date: 2008-08-16 07:24 am (UTC)no subject
Date: 2008-08-16 11:54 am (UTC)Your best bet would have been "kill -9 1". Since "kill 1" didn't work, I very much doubt that would have caused a reboot either.
This is where real rackable kit with LOM shows its worth; remote power cycle :-)
no subject
Date: 2008-08-16 03:47 pm (UTC)For an amature, hobby-run, colo, serial console is a big advance. Full power control would be nice, but we all have keys to his house.
no subject
Date: 2008-08-16 03:53 pm (UTC)If run as a user process as shown in the second synopsis line, init will emulate AT&T System V UNIX behavior, i.e., super-user can specify the desired run-level on a command line, and init will signal the original (PID 1) init as follows: Run-level Signal Action 0 SIGUSR2 Halt and turn the power off 1 SIGTERM Go to single-user mode 6 SIGINT Reboot the machine c SIGTSTP Block further logins q SIGHUP Rescan the ttys(5) fileSo "init 6" is the same as "kill -2 1" on FreeBSD. If it ignored a "kill -9" (which is unblockable) then "kill -2" won't stand much chance :-)
One reason I like virtual colos (eg linode or Panix v-colo) is that you get a (pseudo) console and (pseudo) power management ability. For the low resource needs I have, such virtual machines are perfect.
no subject
Date: 2008-08-16 04:37 pm (UTC)no subject
Date: 2008-08-16 05:24 pm (UTC)no subject
Date: 2008-08-16 06:42 pm (UTC)That said, when my FreeBSD 6.1+ did that, the only recovery was a power cycle. Any VFS activity went into uninterruptible wait.
no subject
Date: 2008-08-16 12:23 pm (UTC)no subject
Date: 2008-08-16 12:54 pm (UTC)"When the mechanical-technological things in our life break down, it is not a personal attack on us. It is just the nature of the mechanical-material world."
*hugs*
no subject
Date: 2008-08-16 02:21 pm (UTC)I've come to realize that having a remote serial console isn't as useful as a remotely accessible power outlet--if I had to choose one or the other, I'd take the remote reboot capability over remote console any day. The days of Sun hardware that can be dropped into its PROM and an ok> prompt for a quick reboot are long gone ...
no subject
Date: 2008-08-16 03:56 pm (UTC)It's a lie to say that Windows admins have to keep rebooting their boxes. Windows is perfectly capable of rebooting itself without any admin intervention... and does so frequently.
no subject
Date: 2008-08-16 02:31 pm (UTC)no subject
Date: 2008-08-16 02:49 pm (UTC)no subject
Date: 2008-08-16 06:10 pm (UTC)I was really not a fan of Sun's LOM (not sure which sort it was) at a previous job; then I got to this job and realized the problem there was the co-worker who was the Sun guy didn't understand what they were and how to use them but didn't let that stop him from deciding how we would use them.
no subject
Date: 2008-08-16 05:15 pm (UTC)Otherwise, perhaps you need a weasel, or just a random generic watchdog card...
no subject
Date: 2008-08-16 08:24 pm (UTC)Ah the weasel... yes, that's one fine card. I always coveted systems that had it.
~fans self~
Date: 2008-08-17 06:34 pm (UTC)no subject
Date: 2008-08-17 07:05 pm (UTC)That said, there have been UFS2 and snapshot bugs, too, though I've not actually encountered any of them personally.
If you have a serial console, and if you have DDB compiled into the kernel, a break on the serial line should drop you into the debugger, from which you can reboot. I think. (I've not used a serial console in a long while.)