Troubleshooting approach

Overview
Example
Things to keep an eye on

Overview

A vital part of being a sysadmin

As you get more experience and knowledgeable your troubleshooting methodology evolves.

troubleshooting-approach

If you have been working for more than an hour on a problem take a break, rethink what you do and don't know.

Example

example User A cannot get to a certain website, here are some possible troubleshooting steps:

  • ask the user to try another site
  • do other servers/hosts work to the website (if so router/switch likely ok, narrowing down to 1 host)
  • ifconfig is eth0 active (local IP, not a 169.x.x.x address)
  • ping another internal IP 192.168.4.4
  • ping dns 8.8.8.8
  • dig google.com (if it brings an IP dns is able to work)
  • traceroute google.com
  • route command (check destination=default exists may be dhcp or staticly entered)
  • check local network (firewall/dhcp)

Things to keep an eye on

  • DNS and networking issues can be intermittent and can be very hard to pin down, typical symptons:

slower than usual to login via ssh
mysql being slower than you would expect
appache having issues loading pages when there are no other obvious problems

  • weird IO issues on a vm

check their are no snapshots
check no memory balloning
check no cpu ready issues

  • common things to check

check services are running
check for config errors
check firewall
check /var/log/secure for Pam auth issues
check se linux
check for reboots
check dmesg
check for recent updates
check for sar
use last to check if someone has rebooted the box (and to see who was logged in)
check /var/log/secure for failed logins and anyone using sudo

  • issue -high load at odd times
  • check pattern & timings
    run iostat -x 1 to check which disk was being effected
    find which disk was the cause
    check what is using the disk using ps -ef (then lsof any processes you find) and lsof +D /etc or iotop if you have it