Page MenuHomekolab.org

guam drove complete cyrus replicated murder system to death
Closed, ResolvedPublic

Description

yum list installed guam
Loaded plugins: product-id, rhnplugin, search-disabled-repos, subscription-manager
This system is receiving updates from RHN Classic or Red Hat Satellite.
Installed Packages
guam.x86_64 0.8.3-1.1.el6.kolab_14 @kolab-14-extras-audit

After migration of 56636 mail accounts and setting them productive on sunday 31.07.2016 we awaited the big user storm on monday 01.08.2016 arround 08:00 am and 10:00 am.
After users dripped in from 07:00 am on the response times of simple LIST requests grew up to minutes for just a couple of folders. At some time the system didn't respond any more and guam and cyrus and the clients started to drop sessions and listeners with timeouts and unexpectedly by remote host.
Since we are not abel to read and understand the guam logs we'll send them and the complete maillogs to you to have a look to and maybe find what went wrong here. We stabilized the system after deactivating guam and reconfiguring everything to cyrus. Additionaly we changed the cyrus db format from twoskip to skiplist (due to a bug in cyrus 2.5.8 with twoskip) and changed the "nofiles" to 2000000 on frontend and backend servers.

Here the logs:


Details

Ticket Type
Task

Event Timeline

petersen added projects: Guam, Kolab Enterprise 14, Restricted Project.Aug 2 2016, 10:20 AM
petersen raised the priority of this task from 60 to High.
petersen added subscribers: vanmeeuwen, seigo.
petersen added a subscriber: petersen.
greve added a subscriber: greve.Aug 2 2016, 10:41 AM

/var/log/guam/console.log:2016-08-01 09:24:47.545 [error] <0.28149.319> CRASH REPORT Process <0.28149.319> with 0 neighbours exited with reason: no match of right hand value {error,emfile} in kolab_guam_session:post_accept_bookkeeping/4 line 143 in gen_server:terminate/7 line 826

It is running out of filehandles (as defined in asm-generic/errno-base.h: #define EMFILE 24 /* Too many open files */)

For each connection, at least 2 extra filehandles are required since Guam is a proxy: one for the client to Guam, one for Guam to Cyrus (which gets opened on both sides of that connection, so if Cyrus and Guam are on the same machine, each connection takes a minimum of 3 filehandles). This is of course in addition to whatever normal files are opened by e.g. cyrus to access mail in response to IMAP commands.

Other possibilities that may be complicating matters:

  • IMAP clients may not closing their connection to Guam (and so holding open those filehandles), and perhaps even connecting multiple times
  • It is possible there is a bug where some filehandle or another is not being closed by Guam. Prior to the 0.8.3 release one such bug was identified and fixed; I have not be able to identify other similar issues, however, having tested the various failure modes (sudden client disconnect, server disconnect, crash/closure of Guam internal connection handler)

So it could be a simple matter of actually running out of file handles due to load. Do we have numbers for how many connections per frontend are active? (Simple lsof | wc -l can be informative in this case ..) As noted in the Guam troubleshooting guide, one may also attach to Guam from an Erlang console, run observer:start() and examine the behavior of the Guam processes in near real time. It would be possible for me to do this remotely even, given appropriate (temporary) access to one of the frontend systems, or via screenshare with a system admin there.

A failure of a load-balancer under the product name of F5 may in fact be creating long-lasting connections as well, as it tends to send a SYN, await the SYN, ACK, but never RST or ACK,FIN.

petersen lowered the priority of this task from High to 60.Sep 12 2016, 12:44 PM

Is this still an issue, or did the increase of filehandles configuration resolve the issue?

petersen closed this task as Resolved.Sep 20 2016, 9:59 AM

I will close this for now - Please feel free to reopen if there are still issues related to this.

If it is reopened, it will be helpful to get a short status of the issue(s).

seigo moved this task from Backlog to Done on the Guam board.Dec 13 2016, 2:11 PM