Appendix B: Diagnostics

  • root ssh keys not setup – If you are prompted for a password when ssh to the service node, then check to see if /root/.ssh directory on MN has authorized_keys file. If the directory does not exist or no keys, run xdsh service -K, to exchange the ssh keys for root. You will be prompted for the root password, which should be the password you set for the key=system in the passwd table.
  • XCAT rpms not on SN – On the SN, run rpm -qa | grep xCAT and make sure the appropriate xCAT rpms are installed on the servicenode. See the list of xCAT rpms in Diskful (Stateful) Installation. If rpms are missing, check your install setup as outlined in Diskless (Stateless) Installation for diskless or Diskful (Stateful) Installation for diskful installs.
  • otherpkgs(including xCAT rpms) installation failed on the SN – The OS repository is not created on the SN. When the “yum” command is processing the dependency, the rpm packages (including expect, nmap, and httpd, etc) required by xCATsn can’t be found. In this case, check whether the /install/postscripts/repos/<osver>/<arch>/ directory exists on the MN. If it is not on the MN, you need to re-run the copycds command, and there will be files created under the /install/postscripts/repos/<osver>/<arch> directory on the MN. Then, you need to re-install the SN.
  • Error finding the database/starting xcatd – If on the Service node when you run tabdump site, you get “Connection failure: IO::Socket::SSL: connect: Connection refused at /opt/xcat/lib/perl/xCAT/”. Then restart the xcatd daemon and see if it passes by running the command service xcatd restart. If it fails with the same error, then check to see if /etc/xcat/cfgloc file exists. It should exist and be the same as /etc/xcat/cfgloc on the MN. If it is not there, copy it from the MN to the SN. The run service xcatd restart. This indicates the servicenode postscripts did not complete successfully. Run lsdef <service node> -i postscripts -c and verify servicenode postscript appears on the list..
  • Error accessing database/starting xcatd credential failure– If you run tabdump site on the service node and get “Connection failure: IO::Socket::SSL: SSL connect attempt failed because of handshake problemserror:14094418:SSL routines:SSL3_READ_BYTES:tlsv1 alert unknown at /opt/xcat/lib/perl/xCAT/”, check /etc/xcat/cert. The directory should contain the files ca.pem and server-cred.pem. These were suppose to transfer from the MN /etc/xcat/cert directory during the install. Also check the /etc/xcat/ca directory. This directory should contain most files from the /etc/xcat/ca directory on the MN. You can manually copy them from the MN to the SN, recursively. This indicates the the servicenode postscripts did not complete successfully. Run lsdef <service node> -i postscripts -c and verify servicenode postscript appears on the list. Run service xcatd restart again and try the tabdump site again.
  • Missing ssh hostkeys – Check to see if /etc/xcat/hostkeys on the SN, has the same files as /etc/xcat/hostkeys on the MN. These are the ssh keys that will be installed on the compute nodes, so root can ssh between compute nodes without password prompting. If they are not there copy them from the MN to the SN. Again, these should have been setup by the servicenode postscripts.
  • Errors running hierarchical commands such as xdsh – xCAT has a number of commands that run hierarchically. That is, the commands are sent from xcatd on the management node to the correct service node xcatd, which in turn processes the command and sends the results back to xcatd on the management node. If a hierarchical command such as xcatd fails with something like “Error: Permission denied for request”, check /var/log/messages on the management node for errors. One error might be “Request matched no policy rule”. This may mean you will need to add policy table entries for your xCAT management node and service node.
  • /install is not mounted on service node from management mode – If service node does not have /install directory mounted from management node, run lsdef -t site clustersite -i installloc and verify installloc="/install"