Troubleshooting the SP2


This document explains how to deal with some of the more common SP2 problems.

Symptoms:


Restarting the switch

The switch needs to be restarted if one or more nodes are not attached to it, for example after a node has been rebooted. The switch status of all nodes can be checked from sp2-cw with:
/p/sp2/bin/switch_stat

"switch_responds" should be 1 for all nodes. If it is not, and all nodes are up, then the switch needs to be restarted with:

/usr/lpp/ssp/bin/Estart

Note: You'll need to log in as root to restart the switch. "su" won't work since it won't give you the required Kerberos 4 tickets for the other nodes..

If the Estart command complains that the primary node does not have a fault service worm running, you will need to do this first:

/usr/lpp/ssp/bin/pexec 1 /usr/lpp/ssp/css/rc.switch

Then retry the Estart command, as above.

Rebooting nodes

If a node does not respond, or is otherwise behaving irrationally, it might need to be rebooted. To reboot (a) node(s), from the control workstation (sp2-cw), as root:

/usr/lpp/ssp/bin/cstartup sp2-04 .... 

Reboots the given nodes; only works for nodes that are down. The node should be back up within about 5 minutes. If it does not come up, try the cshutdown command (see below).

/usr/lpp/ssp/bin/cstartup -Z sp2-04 .... 

Shutdown, then reboot currently running nodes.

/usr/lpp/ssp/bin/cshutdown -r sp2-04 ... 

Shutdown and reboot a node. Average time for a node to come back up is about 5-10 minutes. This command powers off the node after shutdown, so it should work even in cases where the node won't respond to cstartup.

Notes:
- Several node names can be specified to reboot several nodes concurrently
- The cstartup and cshutdown commands wait until all the nodes are completely up before returning control to the user. - After the nodes are back up, the switch will need to be restarted for those nodes to rejoin the switch.

Checking if a node is up

Standard methods can be used to check the status of a node (ping, etc..). You might also want to try logging in to the suspicious node; a node could be hung but still be responding to ping.
/p/sp2/bin/node_stat

This command will show you the SP2's view of which nodes are up or down.


Last updated Tue Jan 28 10:55:33 CST 1997 by Marc Dionne
dionne@cs.wisc.edu