optionalXtra Batteries not included

12Jan/120

Exalogic Compute Nodes Hanging On NFS Mount During Boot

I am currently in the process of commissioning an eight node (quarter rack) Exalogic environment for a customer. Although not serious, compute nodes seemed to take quite a long time to complete a reboot or power cycle. After observing the boot process via the iLOM console of the X4170M2, I noticed that the boot process on Oracle Enterprise Linux (OEL) seemed to get stuck for quite some time when trying to mount NFS shares, and eventually timed out.

Within the Exalogic system NFS shares are mounted from the 7320 ZFS Storage Appliance. The compute nodes should mount these NFS over the Infiniband private interconnect which is usually located on BOND0.

A closer investigation of the boot log, revealed that the netfs script, which is responsible for bringing up the NFS mounts, eventually failed because there was no response from the NFS appliance. As mentioned earlier, if the Exalogic initial configuration has been completed by ACS, then the compute nodes should be mounting NFS over Infiniband. However it would seem that the Infiniband framework had not yet initialized at that point.

After trawling through the init scripts in /etc/rc3.d, I noticed that the openibd script, which is responsible for starting up the Infiniband framework, had a lower start priority than netfs which had a priority of (25). Closer inspection of openibd however revealed that the actual start priority of the script was (05) and not the value that it was being linked at. The culprit was the script's #Required-Start meta key, that imposed a dependency on $localfs.

A quick grep of the scripts in /etc/init.d will reveal that $localfs is provided by none other than netfs, hence the Infiniband framework startup will be delayed until $localfs is available via netfs, which in turn is failing because all the required network interfaces aren't available yet. This is due to the fact that chkconfig on Redhat and OEL calculate the actual startup priority based on the defined dependencies in the meta keys of the scripts.

In my opinion netfs should not be providing $localfs but this has been a bit of a contentious issue within Redhat development for some time. See this bug ID from Redhat.

The workaround is quite simple however. Just remove the dependency on $localfs from the /etc/init.d/openibd script and reload the startup configuration with the commands below:


chkconfig --del openibd

chkconfig --del mlx4_vnic_confd

chkconfig --list openibd

chkconfig --list mlx4_vnic_confd

chkconfig --add openibd

chkconfig --add mlx4_vnic_confd

chkconfig --list openibd

chkconfig --list mlx4_vnic_confd

ls -l /etc/rc3.d/ | grep openibd

ls -l /etc/rc3.d/ | grep mlx4_vnic_conf

Happy hacking!

FINAL NOTE: If the compute nodes do not seem to hang during boot when trying to mount NFS shares, double check the IP/hostname used in /etc/fstab for the 7320 appliance to make sure that the compute nodes are not mounting NFS via the ethernet interface on the nodes and the storage appliance.