Health monitor timing

shakedown1979 · July 2011

Posted by shakedown1979

Background:
I have several Apache hosts (~60) on a single instance -- all of which need basic http/https monitors. I have matched the interval/retry/timeout values close to what our previous load balancers used. Here is an example of one of the new monitors:

health monitor M-HTTP interval 10 retry 1 timeout 4
strictly-retry-on-server-error-response
method http url GET /monitor/index.html expect "Web Server Running" host www.abc.com

Problem:
Hosts are being marked down much more frequently than on the previous system. The "show health stat" down cause is 5 (HM_HTTP_TIMEOUT)

I have looked at several wireshark traces and the apache host is definitely delaying the response after the AX sends a GET. The question is why does it delay now and not on the previous load balancers?

More traces have shown the previous LB (big red ball) schedule their monitors of various hosts at different times, while the AX sends a burst of SYN packets to the apache hosts all at the same time on the 10 second intervals. My guess is Apache is being overloaded with this new change.

I will do more investigating, but I thought I would run this by the community to see if anyone else has seen something similar.

Thanks!

[Deleted User] · July 2011

Posted by ddesmidt

We do NOT send all our health requests at the same time.
I've tested that with 1000s of servers and AX was sending his healthcheck to the different servers at different time.

Now in your case:
. you configure AX to do 60 healths to the same server
. and not 1 health to 60 different servers

In that case I don't know how we behave. I'll look at it and come back to you.
BTW for my info, did you create:
. a huge compound healthcheck and associated it to your unique service group
. or 1 healthcheck per host and associated each to 1 unique service group

Otherwise one important comment:
With your configuration, the AX will actually mark the server down after the very first failure (even if you said one retry).
For some healthcheck methods (I don't know the list, but from the tests done I would say most), AX will mark the server down after the first failure.
To avoid such behavior and ask AX to really do the retry configured, then you have to use the option "stricty retry" (in CLI: strictly-retry-on-server-error-response).
This should avoid your false positive.

Dimitri

shakedown1979 · July 2011

Posted by shakedown1979

Dimitri -- Thanks for the quick response!

You are correct, in our case, we send the monitor checks to several hosts on the SAME server. To answer your question we have defined separate monitors for each host. Also, "strictly-retry-on-server-error-response" is configured to ensure we retry 1 time. This is the same Apache architecture as with our previous LB (which behaved differently).

I am going to get together with our Apache group to see if I can get more details on what is cause the occasional delays from their logs.

Thanks!

shakedown1979 · July 2011

Posted by shakedown1979

Figured I would bump this thread for anyone else trying to understand health checks. We have a large number of hosts configured on the same physical server. To prevent the problems described above we had to rate limit the number of health checks being sent out at the same time. We went with 25 per 500ms (see commands below).

health disable-auto-adjust
health check-rate 25

(config)#health ?
check-rate Define the Health Check Rate
disable-auto-adjust Disable the Health Check Rate Auto Adjustment
(config)#health check-rate ?
<1-50000> Health check rate per 500ms (default 1000)

Health monitor timing

Comments