yesthattom: (Default)
[personal profile] yesthattom
My FreeBSD server (the one in the co-lo that I use for most email, IRC, and other things) is down right now. It was messed up and wouldn’t respond to the “reboot” command. I had 5-6 of them queued up but they did nothing. I tried everything includg killing individual processes, etc. “kill -1 1” and “kill 1” didn’t do anything. Old Unix had dangerous commands like “crash” and “panic” which would have gotten me out of this situation, but alas those have been removed (and for good reason).

The problem started at exactly midnight when a backup process tried to do a UFS snapshot before backups started. I can see the snapshot command is stuck. I’m sure this is a file system bug and now everything else that tries to touch too much of the file system also gets stuck.

Here’s a Unix tip: when you run a command that might not give you your prompt back, put it in the background by pressing “&” before you press return.

I’d been putting nearly every command I typed in the background. Even “ps” and “ls”.

I have remote access to the serial console, so I’m in pretty good shape as things go wrong on the machine. I can still be “root”. However, issuing the “reboot” command does nothing.

After a half hour or more of not being able to reboot this machine, I remember the “init” command might do something. Sadly, the man page wouldn’t display right, so instead of finding another FreeBSD machine to be sure, I type “init q” thinking it was the “quick reboot”. No, that resets all the gettys, which will never complete because of the state this machine is in. (the correct command is “init 6” which would have rebooted it, I think).

Of course this is the one command I didn’t put in the background. It is also one of the few commands that can’t be CTRL-Z’ed or CTRL-C’ed or anything. So my machine is stuck saying that it trying to kill process 718 over and over again. This is on the serial console. No hope for fixing it.

I need someone to visit the colo and physically power the machine off and back on.

I fail at rebooting.

Date: 2008-08-16 07:24 am (UTC)
From: [identity profile] bikergeek.livejournal.com
Good luck, and hopefully your filesystems come back up OK.

Date: 2008-08-16 11:54 am (UTC)
From: [identity profile] sweh.livejournal.com
Unlikely that "init 6" would work there, either. That merely sends a signal to the existing init process to do work.

Your best bet would have been "kill -9 1". Since "kill 1" didn't work, I very much doubt that would have caused a reboot either.

This is where real rackable kit with LOM shows its worth; remote power cycle :-)

Date: 2008-08-16 03:47 pm (UTC)
From: [identity profile] yesthattom.livejournal.com
I had done "kill 1", "kill -1 1", and "kill -9 1". If "init 6" just does those, then you are right.

For an amature, hobby-run, colo, serial console is a big advance. Full power control would be nice, but we all have keys to his house.

Date: 2008-08-16 03:53 pm (UTC)
From: [identity profile] sweh.livejournal.com
From the manpage:
    If run as a user process as shown in the second synopsis line, init will
     emulate AT&T System V UNIX behavior, i.e., super-user can specify the
     desired run-level on a command line, and init will signal the original
     (PID 1) init as follows:

     Run-level    Signal     Action
     0            SIGUSR2    Halt and turn the power off
     1            SIGTERM    Go to single-user mode
     6            SIGINT     Reboot the machine
     c            SIGTSTP    Block further logins
     q            SIGHUP     Rescan the ttys(5) file


So "init 6" is the same as "kill -2 1" on FreeBSD. If it ignored a "kill -9" (which is unblockable) then "kill -2" won't stand much chance :-)

One reason I like virtual colos (eg linode or Panix v-colo) is that you get a (pseudo) console and (pseudo) power management ability. For the low resource needs I have, such virtual machines are perfect.

Date: 2008-08-16 04:37 pm (UTC)
From: [identity profile] gerardp.livejournal.com
BTW, APC power control systems (including UPSes) with IP access are not that expensive. I did a project where I had systems in other countries that were not under direct sysadmin control. We setup APC "front-ends" with IP addresses where we could power cycle the machines. It was very easy to setup and useful. Just fyi ...

Date: 2008-08-16 05:24 pm (UTC)
From: [identity profile] mrfantasy.livejournal.com
Probably doesn't help when you're over 10,000 miles away, though.
Edited Date: 2008-08-16 05:24 pm (UTC)

Date: 2008-08-16 06:42 pm (UTC)
geekosaur: Chuck the FreeBSD Daemon (freebsd)
From: [personal profile] geekosaur
init 6 is just kill 1 with some attempt at cleanup beforehand. The cleanup would have hung.

That said, when my FreeBSD 6.1+ did that, the only recovery was a power cycle. Any VFS activity went into uninterruptible wait.

Date: 2008-08-16 12:23 pm (UTC)
From: [identity profile] cpj.livejournal.com
Init 6 is what you would have wanted, but I suspect that your OS was in a pretty messed up state. Haven't touched FreeBSD in years, but I did just upgrade my Solaris 10 host to release 05/08.

Date: 2008-08-16 12:54 pm (UTC)
From: [identity profile] awfief.livejournal.com
*hugs* I have this quote on my wall, it's from a daily calendar, I think from "Meditations For Women who do too much" (from Sunday, Feb. 23 2003 if that helps):

"When the mechanical-technological things in our life break down, it is not a personal attack on us. It is just the nature of the mechanical-material world."

*hugs*

Date: 2008-08-16 02:21 pm (UTC)
From: [identity profile] dossy.livejournal.com
See, if you were only running a Windows operating system, the machine would have blue screened and possibly after writing a mini-dump crash image, rebooted itself.

I've come to realize that having a remote serial console isn't as useful as a remotely accessible power outlet--if I had to choose one or the other, I'd take the remote reboot capability over remote console any day. The days of Sun hardware that can be dropped into its PROM and an ok> prompt for a quick reboot are long gone ...

Date: 2008-08-16 03:56 pm (UTC)
From: [identity profile] sweh.livejournal.com
An old Windows joke...

It's a lie to say that Windows admins have to keep rebooting their boxes. Windows is perfectly capable of rebooting itself without any admin intervention... and does so frequently.

Date: 2008-08-16 02:31 pm (UTC)
From: [identity profile] n5red.livejournal.com
IPMI can be your friend. I can easily control the power remotely on a number of systems.

Date: 2008-08-16 02:49 pm (UTC)
From: [identity profile] kazmat.livejournal.com
For the record, this is one of the reasons I love Suns and hate PC hardware (though many have gotten better recently by including separate service processors). There's nothing like just dropping to the LOM/ALOM prompt and power cycling the box when it misbehaves this badly.

Date: 2008-08-16 06:10 pm (UTC)
agent_dani: (Bailey)
From: [personal profile] agent_dani
Compaq/HP has had a LOM available as an expansion card for many years and it began to be integrated (iLOM) several years ago. Early on there were some issues about remote power control (it could generate a system reset but couldn't control power in some models) but the iLOMs have that capability. I understand Dell has something similar but I've never used it.

I was really not a fan of Sun's LOM (not sure which sort it was) at a previous job; then I got to this job and realized the problem there was the co-worker who was the Sun guy didn't understand what they were and how to use them but didn't let that stop him from deciding how we would use them.

Date: 2008-08-16 05:15 pm (UTC)
From: [identity profile] dr-memory.livejournal.com
If it's a reasonably modern PC and the serial port is built in, it's not out of the range of possibility that some magic control-key combination could drop you into the BIOS and give you the ability to restart from there.

Otherwise, perhaps you need a weasel, or just a random generic watchdog card...

Date: 2008-08-16 08:24 pm (UTC)
From: [identity profile] yesthattom.livejournal.com
Sadly it is a ver old PC and I've never seen such a control sequence. Though, I should research it for the future.

Ah the weasel... yes, that's one fine card. I always coveted systems that had it.

~fans self~

Date: 2008-08-17 06:34 pm (UTC)
From: [identity profile] geeksdoitbetter.livejournal.com
i love it when you speak geek

Date: 2008-08-17 07:05 pm (UTC)
From: [identity profile] http://users.livejournal.com/cgull_/
If it's a big disk, and newfs'ed with lots of inodes and the default 16k/4k block/frag sizes, snapshots can take a really, really long time (I've seen 30 minutes on a 2TB disk). And all other activity on that filesystem hangs during that time too. 4 hours is a little far fetched for that to be the problem, though.

That said, there have been UFS2 and snapshot bugs, too, though I've not actually encountered any of them personally.

If you have a serial console, and if you have DDB compiled into the kernel, a break on the serial line should drop you into the debugger, from which you can reboot. I think. (I've not used a serial console in a long while.)

December 2015

S M T W T F S
  12345
6789 101112
13141516171819
202122 23242526
2728293031  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Feb. 14th, 2026 02:19 am
Powered by Dreamwidth Studios