Strace (and lsof)

Intro
Commands
What does the output (of strace) mean
What does the output (of lsof) mean
More details on the 4th column of lsof(FD -file descriptor)
More details on the 9th column of lsof(NAME)
example 1-If a process is stuck
example 2-A fully worked example
example 3-strace appache
example 4-find more details on a process
example 5-troubleshooting a slow connection
More reading

Intro

Strace traces system calls and signals
Lsof shows open files (remember everything in Linux is a file)

Strace combined together with lsof. Helps you can delve into what a process is doing, what connections it has open, whats it trying to access, is it stuck? As a combination they let you troubleshoot a variety of problems.

Caution a traced process can run slower (i.e there is a performance overhead) . Other sites talk about the dangers of using strace on a production enviroment. i.e http://www.brendangregg.com/blog/2014-05-11/strace-wow-much-syscall.html

Commands

A lot of the following commands when used together can really delve into a system and troubleshoot problems.

the most basic way to run is strace ls

more commonley you would run strace -p -pid

lsof +D /etc will show you which files are using the etc folder

strace -f -p pid will trace child processes as well (ps -L will show all possible threads)

strace -ttt -p PID # will print the time since epoch i.e sudo strace -ttT -f -p `pidof mysqld` will show timing and calls for mysql (the -f willshow a lot more output)

strace -T -p PID #print syscall durations with (distorted) microsecond resolution:

strace -c -p 1234 will show profiling data on a process

lsof -u testuser check which files test user has open

lsof -i will show network connections

lsof -p pid | grep [0-9]w # will show files being written to
lsof -p pid | grep [0-9]r # will show files being read from

strace -e open lsshows you open calls
strace -e trace=open,read ls /home shows you open & read calls
strace -o /tmp/ouput to also copy out to /tmp/ouput.txt

strace -s 1024 ls prints more than the default 32 chars for each string

ss -l # listening ports
ss -p # Show process using socket.
ss -e # shows exteneded information

netstat -p Show the PID and name of the program to which each socket belongs.

What does the output (of strace) mean

A normal running process should show you a lot of continous output.

If the output begins with select it means the Programs is monitoring the file descriptors on the system for activity

If the process is stuck (or for example has a timeout/sleep call coded) it will not be showing continous output.

from the man page of strace

"Errors (typically a return value of -1) have the errno symbol and error string appended.

i.e

WNOHANG) = -1 ECHILD (No child processes)

So When running strace if you see:

the output of EAGAIN it is often raised when performing non-blocking I/O. It means "there is no data available right now, try again later". More descriptions of outputs at http://www-numi.fnal.gov/offline_software/srt_public_context/WebDocs/Errors/unix_system_errors.html

If you see FUTEX_WAIT it means you are tracing the original parent thread, and it’s doing nothing but waiting for some other threads to finish.
To find more details try strace -f -p which will also strace any forked processes & ps -efL | grep pid to show any related threads.
Along with lsof -p pid.

What does the output (of lsof) mean

After running lsof or more specifcally lsof -p pid you will get a selection of output.

The columns are

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

For more details see the below examples & "the More details on the 4th column of lsof (FD -file descriptor) section

More details on the 4th column of lsof(FD -file descriptor)

In particular the following can be useful, to get an idea of what a process is doing

CWD = current working directory
rtd = root directory
txt = current running command

taken from the man page of lsof, but worth including...

FD is the File Descriptor number of the file or:

cwd current working directory;

Lnn library references (AIX);

err FD information error (see NAME column);

jld jail directory (FreeBSD);

ltx shared library text (code and data);

Mxx hex memory-mapped type number xx.

m86 DOS Merge mapped file;

mem memory-mapped file;

mmap memory-mapped device;

pd parent directory;

rtd root directory;

tr kernel trace file (OpenBSD);

txt program text (code and data);

v86 VP/ix mapped file;

FD is followed by one of these characters, describing the mode under which the file is open:

r for read access;

w for write access;

u for read and write access;

space if mode unknown and no lock

character follows;

‘-’ if mode unknown and lock

character follows.

The mode character is followed by one of these lock characters, describing the type of lock applied to

the file:

N for a Solaris NFS lock of unknown type;

r for read lock on part of the file;

R for a read lock on the entire file;

w for a write lock on part of the file;

W for a write lock on the entire file;

u for a read and write lock of any length;

U for a lock of unknown type;

x for an SCO OpenServer Xenix lock on part of the file;

X for an SCO OpenServer Xenix lock on the entire file;

space if there is no lock.

More details on the 9th column of lsof(NAME)

If you see the word pipes as taken from http://www.linfo.org/pipes.html

A pipe is a form of redirection that is used in Linux and other Unix-like operating systems to send the output of one program to another program for further processing.

Redirection is the transferring of standard output to some other destination, such as another program, a file or a printer, instead of the display monitor (which is its default destination). Standard output, sometimes abbreviated stdout, is the destination of the output from command line (i.e., all-text mode) programs in Unix-like operating systems.

Pipes are used to create what can be visualized as a pipeline of commands, which is a temporary direct connection between two or more simple programs. This connection makes possible the performance of some highly specialized task that none of the constituent programs could perform by themselves. A command is merely an instruction provided by a user telling a computer to do something, such as launch a program. The command line programs that do the further processing are referred to as filters.

This direct connection between programs allows them to operate simultaneously and permits data to be transferred between them continuously rather than having to pass it through temporary text files or through the display screen and having to wait for one program to be completed before the next program begins.

A very simple example of the benefits of piping is provided by the dmesg command, which repeats the startup messages that scroll through the console (i.e., the all-text, full-screen display) while Linux is booting (i.e., starting up). dmesg by itself produces far too many lines of output to fit into a single screen; thus, its output scrolls down the screen at high speed and only the final screenful of messages is easily readable. However, by piping the output of dmesg to the filter less, the startup messages can conveniently be viewed one screenful at a time, i.e. dmesg | less

Example 1- What to do if a process is stuck

1) if for example process 1234 run lsof -p 1234 is stuck (run a strace to confirm it i.e strace -f -p 1234) and you run an strace and you get the output of "read(3,"

2) To find what the 3 is related to....

run lsof -np 1234 and find where the 4th column =3 (see the section "More details on the 4th column")

If you cannot find a suitable match, the socket/file has disappeared and the process should have exited (you may need to kill the process)

If you find a match lets trace it and see what is going on

lets presume the output is

mysql 1234 root 3u unix 0xffff88001c96bcc0 0t0 568922 socket

3) run a netstat on the node (second to last column)

netstat -nxp | grep 568922 and you will get output similar to

unix 3 [ ] STREAM CONNECTED 568922 13120 /mysql

so now we can tell it is related to mysql and we have a process id to track

4) the last step, get more details on the process

ps -ef | grep 13120

root 13120 13119 0 08:52 ? 00:00:00 mysql -e show databases ; select sleep (5000);show databases;

Example 2- A fully worked example

1) run mysql -e "show databases ; select sleep (5000);show databases;"

2) ps -ef | grep mysql shows pid of 6122

3) strace -p 6122

Process 6122 attached - interrupt to quit

read(3,

#and then it hangs (because of the sleep)

4) lsof -p 6122

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

mysql 6122 root cwd DIR 252,1 4096 393218 /root

mysql 6122 root rtd DIR 252,1 4096 2 /

mysql 6122 root txt REG 252,1 338792 534138 /usr/bin/mysql

mysql 6122 root mem REG 252,1 65928 917954 /lib64/libnss_files-2.12.so

mysql 6122 root mem REG 252,1 122040 917830 /lib64/libselinux.so.1

mysql 6122 root mem REG 252,1 110960 917956 /lib64/libresolv-2.12.so

mysql 6122 root mem REG 252,1 10192 917652 /lib64/libkeyutils.so.1.3

mysql 6122 root mem REG 252,1 43728 917643 /lib64/libkrb5support.so.0.1

mysql 6122 root mem REG 252,1 174840 917756 /lib64/libk5crypto.so.3.1

mysql 6122 root mem REG 252,1 14664 917791 /lib64/libcom_err.so.2.1

mysql 6122 root mem REG 252,1 941920 917820 /lib64/libkrb5.so.3.3

mysql 6122 root mem REG 252,1 277704 917860 /lib64/libgssapi_krb5.so.2.2

mysql 6122 root mem REG 252,1 469528 917601 /lib64/libfreebl3.so

mysql 6122 root mem REG 252,1 19536 917950 /lib64/libdl-2.12.so

mysql 6122 root mem REG 252,1 135896 917586 /lib64/libtinfo.so.5.7

mysql 6122 root mem REG 252,1 1921216 917757 /lib64/libc-2.12.so

mysql 6122 root mem REG 252,1 90880 917610 /lib64/libgcc_s-4.4.7-20120601.so.1

mysql 6122 root mem REG 252,1 596264 917951 /lib64/libm-2.12.so

mysql 6122 root mem REG 252,1 987096 532633 /usr/lib64/libstdc++.so.6.0.13

mysql 6122 root mem REG 252,1 88600 917802 /lib64/libz.so.1.2.3

mysql 6122 root mem REG 252,1 1950976 532613 /usr/lib64/libcrypto.so.1.0.1e

mysql 6122 root mem REG 252,1 441112 532706 /usr/lib64/libssl.so.1.0.1e

mysql 6122 root mem REG 252,1 113432 917952 /lib64/libnsl-2.12.so

mysql 6122 root mem REG 252,1 40400 917550 /lib64/libcrypt-2.12.so

mysql 6122 root mem REG 252,1 1584680 787479 /usr/lib64/mysql/libmysqlclient.so.16.0.0

mysql 6122 root mem REG 252,1 142640 917555 /lib64/libpthread-2.12.so

mysql 6122 root mem REG 252,1 193168 917841 /lib64/libncursesw.so.5.7

mysql 6122 root mem REG 252,1 154520 917608 /lib64/ld-2.12.so

mysql 6122 root 0r FIFO 0,8 0t0 535651 pipe

mysql 6122 root 1w FIFO 0,8 0t0 535652 pipe

mysql 6122 root 2w FIFO 0,8 0t0 535652 pipe

mysql 6122 root 3u unix 0xffff88001f9f59c0 0t0 537283 socket

5) exporting last line

netstat -nxp | grep 537283

unix 3 [ ] STREAM CONNECTED 537283 6314/mysql

6) ps -ef | grep 6314

root 6314 6313 0 08:20 pts/0 00:00:00 mysql -e show databases ; select sleep (5000);show databases;

root 6407 5233 0 08:39 pts/0 00:00:00 grep 6314

or another way

1) ps -ef | grep mysql

root 6314 6313 0 08:20 pts/0 00:00:00 mysql -e show databases ; select sleep (5000);show databases;

root 6317 6316 0 08:21 pts/0 00:00:00 mysql -e show databases ; select sleep (5000);show databases;

root 6416 5233 0 08:45 pts/0 00:00:00 grep mysql

root 12111 1 0 Aug05 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe --datadir=/var/lib/mysql --socket=/var/lib/mysql/mysql.sock --pid-file=/var/run/mysqld/mysqld.pid --basedir=/usr --user=mysql

mysql 12213 12111 0 Aug05 ? 00:06:09 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock

2) lsof -p 6314

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

mysql 6314 root cwd DIR 252,1 4096 398320 /sbin/custom

mysql 6314 root rtd DIR 252,1 4096 2 /

mysql 6314 root txt REG 252,1 338792 534138 /usr/bin/mysql

mysql 6314 root mem REG 252,1 65928 917954 /lib64/libnss_files-2.12.so

mysql 6314 root mem REG 252,1 122040 917830 /lib64/libselinux.so.1

mysql 6314 root mem REG 252,1 110960 917956 /lib64/libresolv-2.12.so

mysql 6314 root mem REG 252,1 10192 917652 /lib64/libkeyutils.so.1.3

mysql 6314 root mem REG 252,1 43728 917643 /lib64/libkrb5support.so.0.1

mysql 6314 root mem REG 252,1 174840 917756 /lib64/libk5crypto.so.3.1

mysql 6314 root mem REG 252,1 14664 917791 /lib64/libcom_err.so.2.1

mysql 6314 root mem REG 252,1 941920 917820 /lib64/libkrb5.so.3.3

mysql 6314 root mem REG 252,1 277704 917860 /lib64/libgssapi_krb5.so.2.2

mysql 6314 root mem REG 252,1 469528 917601 /lib64/libfreebl3.so

mysql 6314 root mem REG 252,1 19536 917950 /lib64/libdl-2.12.so

mysql 6314 root mem REG 252,1 135896 917586 /lib64/libtinfo.so.5.7

mysql 6314 root mem REG 252,1 1921216 917757 /lib64/libc-2.12.so

mysql 6314 root mem REG 252,1 90880 917610 /lib64/libgcc_s-4.4.7-20120601.so.1

mysql 6314 root mem REG 252,1 596264 917951 /lib64/libm-2.12.so

mysql 6314 root mem REG 252,1 987096 532633 /usr/lib64/libstdc++.so.6.0.13

mysql 6314 root mem REG 252,1 88600 917802 /lib64/libz.so.1.2.3

mysql 6314 root mem REG 252,1 1950976 532613 /usr/lib64/libcrypto.so.1.0.1e

mysql 6314 root mem REG 252,1 441112 532706 /usr/lib64/libssl.so.1.0.1e

mysql 6314 root mem REG 252,1 113432 917952 /lib64/libnsl-2.12.so

mysql 6314 root mem REG 252,1 40400 917550 /lib64/libcrypt-2.12.so

mysql 6314 root mem REG 252,1 1584680 787479 /usr/lib64/mysql/libmysqlclient.so.16.0.0

mysql 6314 root mem REG 252,1 142640 917555 /lib64/libpthread-2.12.so

mysql 6314 root mem REG 252,1 193168 917841 /lib64/libncursesw.so.5.7

mysql 6314 root mem REG 252,1 154520 917608 /lib64/ld-2.12.so

mysql 6314 root 0u CHR 136,0 0t0 3 /dev/pts/0

mysql 6314 root 1u CHR 136,0 0t0 3 /dev/pts/0

mysql 6314 root 2u CHR 136,0 0t0 3 /dev/pts/0

mysql 6314 root 3u unix 0xffff88001d9859c0 0t0 537283 socket

3) ls -l /proc/6314/fd/3

lrwx------ 1 root root 64 Aug 20 08:33 /proc/6314/fd/3 -> socket:[537283]

4) netstat -nxp | grep 537283

unix 3 [ ] STREAM CONNECTED 537283 6314/mysql

example 3-strace appache

shows which files are being read from

Another useful command to trace appache process is

ps auxw | grep sbin/apache | awk '{print"-p " $2}' | xargs strace

or

to run a strace on all appache proceeses run

ps auxw | grep sbin/httpd | awk '{print"-p " $2}' | xargs strace

example 4-find more details on a process

after running the following batch file (date_sleep)

which contained "date; sleep 120 ;date" & which had a pid of 20018

strace -p 20018
Process 20018 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
close(1) = 0
close(2) = 0
exit_group(0) = ?
Process 20018 detached

lsof -p 20018
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
sleep 20018 root cwd DIR 252,0 4096 264168 /tmp/test_strace
sleep 20018 root rtd DIR 252,0 4096 2 /
sleep 20018 root txt REG 252,0 24264 393281 /bin/sleep
sleep 20018 root mem REG 252,0 99158576 532373 /usr/lib/locale/locale-archive
sleep 20018 root mem REG 252,0 1921176 917600 /lib64/libc-2.12.so
sleep 20018 root mem REG 252,0 154528 917650 /lib64/ld-2.12.so
sleep 20018 root 0u CHR 136,0 0t0 3 /dev/pts/0
sleep 20018 root 1u CHR 136,0 0t0 3 /dev/pts/0
sleep 20018 root 2u CHR 136,0 0t0 3 /dev/pts/0

notice under the FD column:
CWD = current working directory
rtd = root directory
txt = current running command

example 5-troubleshooting a slow connection

see Troubleshooting a Specific Url from on/off the network (using firebug/strace)

More reading

There is a lot of good info out there when you search Google some of the better pages...

http://www.brendangregg.com/blog/2014-05-11/strace-wow-much-syscall.html

http://mharrytemp.blogspot.co.uk/2011/10/using-strace-and-lsof-to-track-down.html

http://www.myhowto.org/solving-problems/7-exploring-system-internals-with-lsof-and-strace/

http://www.hokstad.com/5-simple-ways-to-troubleshoot-using-strace

http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/

http://www.ibm.com/developerworks/aix/library/au-lsof.html

http://www.dedoimedo.com/computers/lsof.html

http://ph7spot.com/musings/leveraging-lsof

https://blogs.oracle.com/ksplice/entry/strace_the_sysadmin_s_microscope

http://endlessgeek.com/2014/03/strace-php-process-via-apache/

https://www.youtube.com/watch?v=eF-p--AH37E

strace appache

http://sysadmin.circularvale.com/debugging/using-strace-to-analyze-apache/