Varnish and saint mode: "no backend connection"
Recently we implemented some quite sophisticated caching mechanisms using varnish. In addition to just caching, backend errors and invalidated pages had to be handled gracefully whenever possible so we implemented grace mode, saint mode and backend probes.
Everything was working great during development but then we did some real performance testing... The results really gave us some headache because at some point we noticed the following in varnishlog from time to time:A 11 FetchErrorA A c no backend connection
This error appears when (obviously) no backend is available/healthy. The problem: The health checks at that time reported that the backends were alive. Googling didn't help here as this error didn't seem to be one of the standard issues. So i finally found myself digging in the C code and after some while i found out what was going on.
The reason we got this error was a combination of a not-yet completely finished backend and the (poorly documented) saint mode:
We still had quite an amount of 500s returning from the backends because of database inconsistencies and general errors. Now the saint mode maintains a blacklist of URLs per backend. When searching for a backend to handle a request it first checks against this list to see if the URL is blacklisted for specific backends. Varnish will only request a backend that is not blacklisted for the given URL. This is documented.
The undocumented bit: In order to keep the blacklist compact, Varnish saint mode will blacklist a complete backend server silently (!) once a specific number of blacklisted URLs for this backend is exceeded.
If you get frequent 500s from different URLs of all your backends, all will be marked unhealthy over time, resulting in a "no backend connection" error for subsequent requests.
The quick fix was to raise the URL blacklist size (saintmode_threshold). Since this results in longer URL blacklists and thus more memory consumption and longer lookup time, this is not a sustainable solution for production systems. The real solution was to fix all the errors in the beackend.