The ptrace system call is crucial to the working of debugger programs like gdb - yet its behaviour is not very well documented - unless you believe that the best documentation is kernel source itself! I shall attempt to demonstrate how ptrace can be used to implement some of the functionality available in tools like gdb.
ptrace() is a system call that enables one process to control the execution of another. It also enables a process to change the core image of another process. The traced process behaves normally until a signal is caught. When that occurs the process enters stopped state and informs the tracing process by a wait() call. Then tracing process decides how the traced process should respond. The only exception is SIGKILL which surely kills the process.
The traced process may also enter the stopped state in response to some specific events during its course of execution. This happens only if the tracing process has set any event flags in the context of the traced process. The tracing process can even kill the traced one by setting the exit code of the traced process. After tracing, the tracer process may kill the traced one or leave to continue with its execution.
Note: Ptrace() is highly dependent on the architecture of the underlying hardware. Applications using ptrace are not easily portable across different architectures and implementations.
The prototype of ptrace() is as follows.
#include <sys/ptrace.h>
long int ptrace(enum __ptrace_request request, pid_t pid,
void * addr, void * data)
Of the four arguments, the value of request decides what to be done. Pid is the ID of the process to be traced. Addr is the offset in the user space of the traced process to where the data is written when instructed to do so. It is the offset in user space of the traced process from where a word is read and returned as the result of the call.
The parent can fork a child process and trace it by calling ptrace with request as PTRACE_TRACEME. Parent can also trace an existing process using PTRACE_ATTACH. The different values of request are discussed below.
Whenever ptrace is called, what it first does is to lock the kernel. Just before returning it unlocks the kernel. Let's see its working in between this for different values of request.
This is called when the child is to be traced by the parent. As said above, any signals (except SIGKILL), either delivered from outside or from the exec calls made by the process, causes it to stop and lets the parent decide how to proceed. Inside ptrace(), the only thing that is checked is whether the ptrace flag of the current process is set. If not, permission is granted and the flag is set. All the parameters other than request are ignored.
Here a process wants to control another. One thing to remember is that nobody is allowed to trace/control the init process. A process is not allowed to control itself. The current process (caller) becomes the parent of the process with process ID pid. But a getpid() by the child (the one being traced) returns the process ID of the real parent.
What goes behind the scenes is that when a call is made, the usual permission checks are made along with whether the process is init or current or it is already traced. If there is no problem, permission is given and the flag is set. Now the links of the child process are rearranged; e.g., the child is removed from the task queue and its parent process field is changed (the original parent remains the same). It is put to the queue again in such a position that init comes next to it. Finally a SIGSTOP signal is delivered to it. Here addr and data are ignored.
Stop tracing a process. The tracer may decide whether the child should continue to live. This undoes all the effects made by PTRACE_ATTACH/PTRACE_TRACEME. The parent sends the exit code for the child in data. Ptrace flag of the child is reset. Then the child is moved to its original position in the task queue. The pid of real parent is written to the parent field. The single-step bit which might have been set is reset. Finally the child is woken up as nothing had happened to it; addr is ignored.
These options read data from child's memory and user space. PTRACE_PEEKTEXT and PTRACE_PEEKDATA read data from memory and both these options have the same effect. PTRACE_PEEKUSER reads from the user space of child. A word is read and placed into a temporary data structure, and with the help of put_user() (which copies a string from the kernel's memory segment to the process' memory segment) the required data is written to data and returns 0 on success.
In the case of PTRACE_PEEKTEXT/PTRACE_PEEKDATA, addr is the address of the location to be read from child's memory. In PTRACE_PEEKUSER addr is the offset of the word in child's user space; data is ignored.
These options are analogous to the three explained above. The difference is that these are used to write the data to the memory/user space of the process being traced. In PTRACE_POKETEXT and PTRACE_POKEDATA a word from location data is copied to the child's memory location addr.
In PTRACE_POKEUSER we are trying to modify some locations in the
task_struct
of the process. As the integrity of the kernel has to be
maintained, we need to be very careful. After a lot of security checks made
by ptrace, only certain portions of the task_struct is allowed to change. Here
addr is the offset in child's user area.
Both these wakes up the stopped process. PTRACE_SYSCALL makes the child to stop after the next system call. PTRACE_CONT just allows the child to continue. In both, the exit code of the child process is set by the ptrace() where the exit code is contained in data. All this happens only if the signal/exit code is a valid one. Ptrace() resets the single step bit of the child, sets/resets the syscall trace bit, and wakes up the process; addr is ignored.
Does the same as PTRACE_SYSCALL except that the child is stopped after every instruction. The single step bit of the child is set. As above data contains the exit code for the child; addr is ignored.
When the child is to be terminated, PTRACE_KILL may be used. How the murder occurs is as follows. Ptrace() checks whether the child is already dead or not. If alive, the exit code of the child is set to sigkill. The single step bit of the child is reset. Now the child is woken up and when it starts to work it gets killed as per the exit code.
The values of request discussed above were independent on the architecture and implementation of the system. The values discussed below are those that allow the tracing process to get/set (i.e., to read/write) the registers of child process. These register fetching/setting options are more directly dependent on the architecture of the system. The set of registers include general purpose registers, floating point registers and extended floating point registers. These more machine-dependent options are discussed below. When these options are given, a direct interaction between the registers/segments of the system is required.
These values give the value of general purpose, floating point, extended floating point registers of the child process. The registers are read to the location data in the parent. The usual checks for access on the registers are made. Then the register values are copied to the location specified by data with the help of getreg() and __put_user() functions; addr is ignored.
These are values of request that allow the tracing process to set the general purpose, floating point, extended floating point registers of the child respectively. There are some restrictions in the case of setting the registers. Some are not allowed to be changed. The data to be copied to the registers will be taken from the location data of the parent. Here also addr is ignored.
A successful ptrace() returns zero. Errors make it return -1 and set errno. Since the return value of a successful PEEKDATA/PEEKTEXT may be -1, it is better to check the errno. The errors are
EPERM : The requested process couldn't be traced. Permission denied.
ESRCH : The requested process doesn't exist or is being traced.
EIO : The request was invalid or read/write was made from/to invalid area of memory.
EFAULT: Read/write was made from/to memory which was not really mapped.
It is really hard to distinguish between the reasons of EIO and EFAULT. These are returned for almost identical errors.
If you found the parameter description to be a bit dry, don't despair. I shall not attempt anything of that sort again. I will try to write simple programs which illustrate many of the points discussed above.
Here is the first one. The parent process counts the number of instructions executed by the test program run by the child.
Here the test program is listing the entries of the current directory.
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <syscall.h>
#include <sys/ptrace.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <errno.h>
int main(void)
{
long long counter = 0; /* machine instruction counter */
int wait_val; /* child's return value */
int pid; /* child's process id */
puts("Please wait");
switch (pid = fork()) {
case -1:
perror("fork");
break;
case 0: /* child process starts */
ptrace(PTRACE_TRACEME, 0, 0, 0);
/*
* must be called in order to allow the
* control over the child process
*/
execl("/bin/ls", "ls", NULL);
/*
* executes the program and causes
* the child to stop and send a signal
* to the parent, the parent can now
* switch to PTRACE_SINGLESTEP
*/
break;
/* child process ends */
default:/* parent process starts */
wait(&wait_val);
/*
* parent waits for child to stop at next
* instruction (execl())
*/
while (wait_val == 1407 ) {
counter++;
if (ptrace(PTRACE_SINGLESTEP, pid, 0, 0) != 0)
perror("ptrace");
/*
* switch to singlestep tracing and
* release child
* if unable call error.
*/
wait(&wait_val);
/* wait for next instruction to complete */
}
/*
* continue to stop, wait and release until
* the child is finished; wait_val != 1407
* Low=0177L and High=05 (SIGTRAP)
*/
}
printf("Number of machine instructions : %lld\n", counter);
return 0;
}
open your favourite editor and write the program. Then run it by typing
cc file.c
a.out
You can see the number of instructions needed for listing of your current
directory. cd
to some other directory and run the program from there
and see whether there is any difference. (note that it may take some time
for the output to appear, if you are using a slow machine).
Ptrace() is heavily used for debugging. It is also used for system call
tracing. The debugger forks and the
child process created is traced by the parent. The program which is to
be debugged is exec'd by the child (in the above program it was
"ls") and after each instruction the parent can examine the register
values of the program being run. I shall demonstrate programs which
exploit ptrace's versatility in the next part of this series. Good
bye till then.
Sandeep S
I am a final year student of Government Engineering College in Thrissur,
Kerala, India. My areas of interests include FreeBSD, Networking and also
Theoretical Computer Science.
Copyright © 2002, Sandeep S.
Copying license http://www.linuxgazette.com/copying.html
Published in Issue 81 of Linux Gazette, August 2002