7/31/2008

HA errors after update to esx 3.5 u2

Yesterday I updated my whole ESX farm to 3.5 u2 and suddenly encountered a strange one: A few minutes after updating one of my test clusters gone red telling me the ha agent has an error. So i checked out the communities and found that i was not alone. Many people had the same issue. Now there are several guides online how to fix it, but most of them didn´t solve it for me now here is how I solved it:
First, it seems like this is some kind of DNS issue. So check the hostname in the
vi client. Lets say its esx1.domain.local
Now enter console and check

/etc/hosts

and make sure the entry from the vc is exactly the same, especially check for upper/lowercase mismatches. If your /etc/hosts shows ESX1.domain.local, change it to esx1.domain.local.
Now, thats not all, check

/etc/opt/vmware/aam/FT_HOSTS.

Your cluster members should be in this file, but only the first part f the dns name, if you dns name is esx1.domain.local, only esx1 should be in FT_HOSTS.
If there is any other entry or you are not shure, simply delete FT_HOSTS and reconfigure your cluster. Reboot the ESX hosts.

Now: Mostly these steps seemed to solve the problem but not for my test lab. The next day i encountered the error again. Now this is what I have done in addition which seemed to finally solve the problem.

Put your esx hosts in maintenance mode, remove them from the cluster, delete the cluster and create a new one with a different name. Put your esx hosts out of maintenance mode and assign them the cluster again. Now finally to be 100% sure right click em and reconfigure HA. That whole bunch solved the problem for me (have my eyes now on it for a few hours).



5 comments:

Anonymous said...

Hi, I have updated my hosts to ESX 3.5 U2 but I cannot find the /etc/opt/vmware/aam directory anymore. I do have /opt/vmware/aam but that does not contain the FT_HOSTS file.

I removed the host from the cluster and added it again, but still no FT_HOSTS file.

Do you think VMware has changed the HA agent with U2?

I still have some hosts that are on build 64607 and they have the /etc/opt/vmware/aam directory and the ft_hosts file.

With kind regards,

Jos Rosiau

Joerg Riether said...

Dear Jos,
I am not certain if a fresh install of 35u2 leaves the /etc/opt/vmware/aam completely out as all my machines are updated ones from 35u1. therefor you could be right and it could be that this one´s changed with a fresh 35u2 install.

best regards
Joerg

Anonymous said...

Well, there must have been something really messed up on my systems.

I disabled HA on the cluster and then renabled it again.

Now I do have the /etc/opt/vmware/aam and FT_HOSTS file on the ESX 3.5 U2 systems.

I have to see whether they keep on functioning, but until now they are.

VMWare ESX works in mysterious ways.

With kind regards,

Jos Rossiau

Nagendra Vaidya said...

Even i had similar issues, all I did was update the Virtual Center to Virtual Center 2.5 Update 2 from Update 1 and after the reboot of the Virtual Center Server, the ESX HA Cluster broke. I did not apply any patches to these ESX hosts which are in cluster and still they were broke. I was lucky to have them back by Re-configuring the HA.

Anonymous said...

I had the same problem after upgrade to 3.5 U2. When I compared the Security Profile rules for both cluster nodes, I found out that on problematic node there is no "aam" ports allowed. I add "aam" rule and it looks like solved this issue in my case.