... you stand there waiting for Heather to look up from her keyboard...
Oh! Hi everybody! It's certainly been an active month here with The Answer Gang. We had almost 700 slices of Gazette related mail come past my inbox. The longest thread (not pubbed this month, look forward to it next time) was over 50 messages long. Less than 20 people got no answer whatsoever (not counting the occasional spammer) and the top reason for not getting a post answered, appeared to be simply a lack of interest in that message. Crazy attachments are down a LOT since our sysadmin improved the filters. Ben did a bit more cleanup on the TAG FAQ and Knowledgebase and we have a new posting guidelines page which I hope you find easy to read.
In the land of Linux I'm pleased to note that the 2.4 series kernel is resembling stable since 2.4.17 is over a month old now. A lot of work is being done in 2.5.
Flu struck my area and melted my mind back down to a mere single CPU when I'm used to being an SMP system. Bleh! And before you ask ... yes, I'm feeling better. Lots of liquids, chicken soup, all that.
It appears as though Ghostscript is my evil nemesis of the month. I haven't had time to finish compiling support for that new color printer of mine. In a moment of foolishness I upgraded my Dad-in-law's box and the next few days were completely nuts since kword and gs refused to agree on what fonts to print, or even to get the metrics right so margins would work. They're happy again since I forced ghostscript to uninstall completely and then reinstall. And we still wonder what the heck happened to gnucash in Debian/Woody, though I admit, I haven't looked very hard.
Cheerfully for my mortgage I've had a lot of consulting work this month. Between 600 plus messages and all that, though, there wasn't time for me to fit the usual ten pack (this blurb and nine of the juiciest TAG threads) in under a tighter than usual deadline. Mike will be enjoying a Python conference much of this next month. I hope it counts for a well deserved vacation on his part.
I've not left you completely wanting, though. Here's a few days in the life of The Answer Gang, troubleshooting one of those day to day things that drives everybody nuts once in a while -- segfaults.
Core files are a mess. Good thing we have a dustbin around here.
From Faber Fedor
Answered By Jim Dennis, Dan Wilder, John Karns,
with side comments from Ben Okopnik and Heather Stern
I've got a problem with a RH7.1 machine and no error messages to look at, so I'm wondering how does one debug a problem like this?
Moved a machine from NY to NJ yesterdy. When I left it last night, everything was running, esp. Apache. This morning, normal maintanence occurred at 4:02 AM, and when the system (syslog?) went to restart httpd, the restart failed. It's been failing ever since too!
The only http related message in /var/log/messages is
Dec 22 12:27:13 www httpd: httpd startup failed
Access and error logs for httpd are empty.
Running /usr/sbin/httpd (with and without command line parms) generates the message
Segmentation fault (core dumped)
and the requisite core file:
core: ELF 32-bit LSB core file of 'httpd' (signal 11), Intel 80386, version 1, from 'httpd'
File size and date of /usr/sbin/httpd matches my local copy.
Any ideas where to look next?
-- Regards, FaberJim Dennis pontificates about troubleshooting apache's startup... -- Heather
[JimD] First, I would run /etc/init.d/httpd or /etc/init.d/apache, or whatever it is on your system. Run it with the "start" option.
(Actually I'd read the /etc/init.d/ start script for that service, and probably I'd manually go through it to figure out what I needed to do in order to run this particular installation of Apache correctly).
Did that. That's what I meant by "it crashed at the command line with and wothout parameters.
[JimD] To dig further I might replace the httpd with a short "strace wrapper" script:
#!/bin/bash exec strace -f -o /tmp/apache/strace.out /usr/sbin/httpd.real "$@"
This definitely goes into my bag of tricks (once I decode it )
[JimD] (be sure to mkdir /tmp/apache, and make it writable to the appropriate UID/GID --- whatever the webserver runs as).
I'd look through the strace.out file for clues. Don't leave this running in this fashion for too long. The strace.out files will get huge very quickly; and your performance should suffer a bit.
Considering that it used to work, you did a shutdown, moved the system, brought it back up, and then, presumably, CONFIGURED IT FOR A NEW NETWORK, I'd look very carefully at network masks, routes and related settings.
Very close! The problem turned out to be that the name server the box was using is no longer accessible (the box is there, but dig returns "no name servers were found") and there were no backup name servers in /etc/resolv.conf (mea culpa).
I wouldn't have expected apache to segfault under those conditions, but it did.
[JimD] Also, consider upgrading to RH7.2 if you can.
[Faber] I just got my hands on it earlier this week so I'm still evaluating it.
Red Hat's distribution has been very consistent in it's release history: avoid the .0, skip the .1, and wait for the .2; that's been the rule since 4.2!
[Faber] Normally, that's what I do, but we needed to upgrade to PHP4 ASAP and it was alot easier to upgrade the whole system to 7.1 (from 6.2).
thanks again!
Regards,
Faber
[JimD] You're welcome.
... while Dan took a different approach, considering the core file itself. -- Heather
[Dan] 0) Start by making sure there's no error in your httpd.conf by running
apachectl configtest
No doubt there's nothing there. But if there is, you are not apt to find it by examining core files, etc.
If you're an expert C developer
[Faber] At one point in my life, I might have said that, but then only to impress women like Heather.
[Dan] I don't expect Heather's that easily impressed. Especially by guys like me that mistype "developer".That's ok, I fixed it. That's what editors are for, at least sometimes. I'm more impressed by how people solve problems than by whether they're an expert in everything around them. It's nice if they can solve my problems, though. -- Heather
[Dan] and have the source tree to your apache handy, examining the core file might yield you something.
[Faber] IOW, no, I don't want to do that.
[Dan] Naah, me neither. Last resort.
Mostly it's pretty indirect. Segfaults are typically caused by out-of-bounds pointers or array references, references to allocated memory since freed, confusion about number or type of parameters passed to a function, and the like. The error happens earlier, when the bad pointer is parked someplace, memory is erroneously freed, etc. The fault happens later, when something is dereferenced.
I've spent many a happy and well-paid hour trying, sometimes without success, to track backwards from fault to error. And when you find the error, you may still a long and winding road back to the defect which caused the error.
Defect ---------> Error -------------> Fault (Improper (Something bad (Result becomes code construct) happens) observable as unexpected result)
Unless you're an expert C developer, and patient and lucky as well, it's more likely you'll find the problem by a process of elimination.
1) What's changed recently? New application? Change in httpd.conf? New module installed? Try backing out any recent changes, one by one. Restart apache after each thing you back out.
2) Is it possible there's filesystem corruption? Corrupted binaries often fail to run well. Take the machine down and run
fsck -f
on all filesystems. If you find anything amiss, determine what files were affected.
3) Reinstall apache just in case, anyway.
4) Could the machine have other hardware problems? If you have the kernel development packages installed, build the kernel eight or ten times. If you get "died with signal 11" or other abnormal termination, proceed with hardware troubleshooting procedures.
5) Figure out what area of apache is affected. Save your httpd.conf and start with a default one. Will apache start? If so, re-introduce features from the running copy of httpd.conf a few at a time until apache begins dying at startup.
Let us know how you do. Depending on where you find trouble, the gang can offer further advice. -- Dan Wilder
Jim has quite a bit to say about using strace -- Heather
#!/bin/bash exec strace -f -o /tmp/apache/strace.out /usr/sbin/httpd.real "$@"
[JimD] In runs a shell (bash) which then exec()s (becomes) a copy of the strace command. That strace command is told to "follow forks" (so we can trace the system call of child processes) and writes its output to a file in our /tmp/apache directory. strace then runs (fork()s then exec()s) a copy of the "real" httpd with a set of arguments that matches those that were passed to to our script.
The distinction between exec()'ing a command and invoking it in the normal way is pretty important. Normal command invocation from a UNIX shell involves a fork() (creating a clone process which is a subshell) and then an exec*() by that shell to transform that subprocess into one which is running the target command.
Meanwhile the parent shell process normally does a wait*() on the child. In other words, it sits there, blocked until the child exits, or until a signal is received.
When we use the shell exec command, it prevents the fork() (there's no creation of a subprocess). The "text" (executable binary code) of the process that was running a copy of your shell (/bin/bash in our case) is overwritten by the "text" of the new program; all of the heap and stack segments (memory blocks) of the old process are freed and/or clear) and the only traces of the old memory image that remain available are the contents of the process' environment. In other words, the exec command is a wrapper around the one of the exec*() system calls (there are several different versions of the exec*() system call which differ in the format of their arguments, and the preservation/inheritance versus creation of environments).
Actually I think that Linux kernel implements execve() as a wrapper around its clone() system call, and that libc/glibc provides the handling for all of the variations on that. The three "variables" on these exec variations are:
- format of the command argument list:
- (which is either done through C varargs --- like printf() and friends, or is a pointer to an array of NUL terminated strings), (execv* vs. execl*)
- environment handling:
- whether the process keeps its current environment or overwrites it. The execle() and execve() versions have an extra parameter pointing at an NUL terminated of NUL terminated strings.
- path searching:
- The first argument of the execvp() and execlp() functions can be a simple command basename --- while all other variations require a qualified path. The "p" versions will search the PATH as a shell would.
It appears that you can either search the PATH or create a new environment, but not both. Of course you can use a simple execl() or execv() to do neither. Of course you can read the man exec(3) manual pages in the library functions section of your online docs to read even more details about this.
When I'm teaching shell scripting I spend a considerable amount of time clarifying this worm's eye view of how UNIX and the shell handles fork()s and exec*()s. I draw diagrams representing the memory space and environment of a process, and another of a child process (connected by dotted lines labeled "fork()"). The I crosshatch most of the memory space --- leaving the environment section, and label that exec*().
When I do this, people understand how the environment really works. The "export" shell command moves a shell variable and its value from the local heap "out" to the environment region of memory. Once they really understand that, then they won't get too confused when a child process sets a shell variable, exports, and then their original process can't see the new value. ("export" is more of a memory management operator than an inter-process communications mechanism; at best it is a one-way IPC, copying from parent to children children).
After than I generally have to explain about some implicit forms of sub-process creation (forking) that most people miss. In particular I remind them that pipes are an *inter-process* communications channel. So, any time you see or use a | operator in the shell, you are implicitly creating sub process. That's why a command like:
unset bar; echo foo | read bar; echo $bar
[Ben] Oh, that's cute. I go through pretty much the same spiel - some of it admittedly cribbed from your description of this, because I liked it the first time I heard it - but the way I've been demonstrating it is with awhile read bar; do echo $bar; done < file
loop. This nails down the other end. Very cool.(Scribbling notes in newly acquired Palm Pilot)
[JimD] ... will return an empty value in most shells. The read command is executed in a subprocess which promptly exits, freeing the memory that held its copy of the bar variable/value pair. (I say most shells because ksh '93 and zsh, create their subprocesses on the left hand side of their pipe operators. That's one of those subtle differences among shells. Personally I think bash and others do it wrong, the ksh/zsh semantics are superior and I hope bash 2.x or 3.x will adopt them, or offer a shopt, shell option, to select the desired semantics).
The "$@" ensures that the arguments that were passed to us wil be preserved in count and contents. If we used "$*" we'd be passing a single argument to our command. That single argument would contain the text of all of the orginal arguments, concatenated as one string, separated by spaces (or by the first character from IFS if you believe the docs). If we used $* (no soft quotes) we'd be having the current shell resplit the number of arguments --- they'd have the same contents, but any arguments that had previously had embedded spaces (or other IFS characters) would be separated accordingly.
The "$@" handling is the most subtle part of this script. An unquoted $@ would be be the same as an unquoted $* (as far as I can tell). It is just the "$@" that gets the special handling. ($* and "$*" aren't special cases, they are expanded and split in the normal way; "$@" is expanded and sort of "internally requoted" to preserve the $# --- argument count).
If you were going to need to do this frequently we might write a "strace.wrapper.sh" shell script which would work a bit like this:
#!/bin/bash OLDMASK=$(umask) umask 077 TMPDIR=/tmp/$(basename $1)$$ mkdir "$TMPDIR" || exit 1 ## make a temporary directory or die umask $OLDMASK TARGETCMD="$1" shift exec strace -f -o "$TMPDIR/strace.out" "$TARGETCMD" "$@"
In this example we call strace.wrapper.sh with an extra argument, the name of he command to be "wrapped." We then fuss a little with umask (to insure that our process' output will have some privacy from prying eyes, and doing an atomic "make a private dir or die trying" (This is the safest temp file handling that can be managed from sh, as far as I know).
Then we restore our umask, (so we don't create a Heisenbug by challenging one of our target command's hidden assumptions about the permissions of files it creates). We than grab our target command, shift it off our argument list (which does NOT disturb the quoting of the remaining arguments) and call our strace command as before --- with variables interpolated as necessary.
Mind you I don't use this script. I don't bother since I can do it about as easily by hand. Also this script wouldn't be the best choice for CGI, inetd launched, or similar cases. In those cases we're better renaming the original binary.
Of course we were all happy when Faber found what it was! We encouraged him to send in his bug report -- Heather
I wouldn't have expected apache to segfault under those conditions, but it did.
[JimD] Report it as a bug (after upgrading to the latest stable release). Try to isolate the .conf directive(s) that are involved, if possible.
[Dan] ... The error happens earlier, when the bad pointer is parked someplace, memory is erroneously freed, etc. The fault happens later, when something is dereferenced.
Well, as I told Jim, the fact that it couldn't find a name server caused it to segfault. Weird; you would have thought it would have exited wih a message at least.
[John K] It sounds like there's a bug or some abnormality with apache's handling of a situation which is doesn't expect in normal operation. IOW, a problem with error handling. If the apache version is not the latest stable version, you might want to consider upgrading. If it is the latest, then you may want to consider reporting it to the apache developers.
...and of course we congratulated him on his success, with some extra thoughts on general troubleshooting. -- Heather
[Dan] Congradulations on solving the problem.
That's what I call the "natural history approach". Examine carefully the behavior and habitat of the creature in question, and think carefully about what you've observed.
I've probably fixed a lot more bugs in my life by the natural history method, than I have by the method of examining core files, or for that matter running under a debugger or emulator.
Strace, mentioned separately in this thread, is a little harder to classify. A program that attaches itself to a running process and dumps out information about system calls, it affords a level of information about a program that may sometimes come close to what you'd see using a debugger.
Mostly it doesn't, but sometimes it provides that key observation not available by other means which allows us to finally come to grips with a bug. I'd group it with natural history tools, perhaps as an analog to a radio collar. You know where the animal's been, but maybe not why, or what it did there. -- Dan Wilder
[JimD] I like to use the classic "OSI reference model" as a rough troubleshooting sequence. Keep going down the stack (from application, down through network and to the physical layers until you isolate the problem, then proceed back upwards correcting each problem until the application works).