optionalXtra Batteries not included


Exalogic Compute Nodes Hanging On NFS Mount During Boot

I am currently in the process of commissioning an eight node (quarter rack) Exalogic environment for a customer. Although not serious, compute nodes seemed to take quite a long time to complete a reboot or power cycle. After observing the boot process via the iLOM console of the X4170M2, I noticed that the boot process on Oracle Enterprise Linux (OEL) seemed to get stuck for quite some time when trying to mount NFS shares, and eventually timed out.

Within the Exalogic system NFS shares are mounted from the 7320 ZFS Storage Appliance. The compute nodes should mount these NFS over the Infiniband private interconnect which is usually located on BOND0.

A closer investigation of the boot log, revealed that the netfs script, which is responsible for bringing up the NFS mounts, eventually failed because there was no response from the NFS appliance. As mentioned earlier, if the Exalogic initial configuration has been completed by ACS, then the compute nodes should be mounting NFS over Infiniband. However it would seem that the Infiniband framework had not yet initialized at that point.

After trawling through the init scripts in /etc/rc3.d, I noticed that the openibd script, which is responsible for starting up the Infiniband framework, had a lower start priority than netfs which had a priority of (25). Closer inspection of openibd however revealed that the actual start priority of the script was (05) and not the value that it was being linked at. The culprit was the script's #Required-Start meta key, that imposed a dependency on $localfs.

A quick grep of the scripts in /etc/init.d will reveal that $localfs is provided by none other than netfs, hence the Infiniband framework startup will be delayed until $localfs is available via netfs, which in turn is failing because all the required network interfaces aren't available yet. This is due to the fact that chkconfig on Redhat and OEL calculate the actual startup priority based on the defined dependencies in the meta keys of the scripts.

In my opinion netfs should not be providing $localfs but this has been a bit of a contentious issue within Redhat development for some time. See this bug ID from Redhat.

The workaround is quite simple however. Just remove the dependency on $localfs from the /etc/init.d/openibd script and reload the startup configuration with the commands below:

chkconfig --del openibd

chkconfig --del mlx4_vnic_confd

chkconfig --list openibd

chkconfig --list mlx4_vnic_confd

chkconfig --add openibd

chkconfig --add mlx4_vnic_confd

chkconfig --list openibd

chkconfig --list mlx4_vnic_confd

ls -l /etc/rc3.d/ | grep openibd

ls -l /etc/rc3.d/ | grep mlx4_vnic_conf

Happy hacking!

FINAL NOTE: If the compute nodes do not seem to hang during boot when trying to mount NFS shares, double check the IP/hostname used in /etc/fstab for the 7320 appliance to make sure that the compute nodes are not mounting NFS via the ethernet interface on the nodes and the storage appliance.


Vodacom BlackBerry Internet Service cap introduced

22 Sept 2011 Update: Seems that Vodacom has back-tracked on this for now...

Vodacom BlackBerry Internet Service cap introduced « Cellular « MyBroadband Tech and IT News.

Right, so I thought the whole idea of running a mobile communications business was to keep on winning, or in more marketing speak, acquiring customers.

"Uncapped" or unlimited browsing on the Blackberry Internet Service has probably been the single most attractive reason to acquire a Blackberry device in South Africa.

Now perhaps I missed a memo somewhere sometime, but this will most probably end up hurting Vodacom more in the long run. Are they killing the goose that laid the golden eggs, or are they inviting their competitors to come and poach it?