Page MenuHomePhorge

make pykolab synchronize queueing mechanism reliable
Open, LowPublic

Description

In cases where the processing of "def synchronize", let's say P(a), breaks unexpectedly, the "last_change" in domain.tld.db is set to the last_change date of the ldap filter, let me name it F(a) result, and this I call R(a), item just processed before the failed item/entry processing.
Now the effect is that while we changed or added some users relevant to the LDAP filter and due to what ever reason the processing P(a) breakes after adding, let's say, one or just not all results [R(a)] of filter [F(a)], some of them to the kolab system, all new users not processed withing the "synchronize" run [P(a)] just crashed will not be processed at any time any more.

It would be possible to just run "kolab sync --resync" regularly but than the process itself, even "kolab sync --resync, is still not reliable.

We here discussed the problem and think that it would be a good idea to switch the query result item processing initiation to a more reliable queuing mechanism.
As there would possibly be a different "last_change" field functionality in domain.tld.db differentiating [ldap] last_change and [kolab] last_change.
With this it would be possible to:

  1. process sync on all domain.tld.db entries where [ldap] last_change >= [kolab] last_change and update "[kolab] last_change = now" entry by entry after processing
  2. request LDAP user records based on max([kolab] last_chage) and additional regular filter settings and update and integrate domain.tld.db
  3. process as described in 1.

I know that this would change the design of sync massively and changes sync design at all. But gives much more reliability.

Details

Ticket Type
Task

Event Timeline

Which version of pykolab is this relating to?

at least, intense tests where at pykolab 0.7.27 we are currently testing pykolab 0.7.28 but discover other porblems "TIMEOUT" as reported of someone else in T1414 first. We are analysing this and tend to add our findings and logs to T1414 before we preceede testing reliability of 0.7.28 in conjunction with T1307 (this).

The mechanism of "kolab sync" did not change an remains inreliable even with T1414 and the related commit D208.
Maybe the OPT_TIMEOUT on immediate LDAP connect mentioned there will go away with the new (not yet provided) version of pykolab but that will not make the process itself enterprise like reliable

What are the exact commands that you run?

What do you meen by "exact commands"?
One of those?

  • "kolabd" is started during system boot
  • "kolab sync" some times "kolab sync -d 9" some times "kolab sync -d 9 -l debug"
  • "kolab sync --resync" some times "kolab sync --resync -d 9" some times "kolab sync --resync -d 9 -l debug"

Having read through your response and analyzed the issue further, we still come back to the same conclusion. The instability that you refer to really seems to be caused by the way that your environment (assuming the load-balancer between your server and the client) is breaking connections in a non-standard way - mainly without sending a RST or FIN off to the client. The only way for the software to know, if a connection is still alive or not, is to ask the OS TCP stack, which in this case would not help, as the OS would still think that the connection is live. The failing connection would then come down to TCP/IP ttl or time outs.

Kolabd is very well tested (also with the presences of ESG staff) with regards to how it reacts in case of LDAP or IMAP failures. If the connection is getting dropped in the middle of the creation of a mailbox, this mailbox will indeed be declared done, and the missing folders and annotations will be created - either at the users first login (when this happens via the Kolab Webclient or Kontact).

This works well, and when testing - even with multiple thousands of users, it is not failing in any of our tests.

This means, that to assist you in the specific situation, we will really need nitpicking details.

We have looked at these issues together in the past, and we are very willing to discuss the issues again. I will be happy to set up a call about this at any time needed (although it is probably most effective when Colja is back from vacation next week). However we do expect the outcome to be the same - that the fail of the environment (assuming the load-balancer) to bring down a connection in a accepted manner will be pointed out as the root cause.

Please let me know if you want to have a call, and if so - when.

This issue is not about instability but inreliability of the base mechanism.

What ever disturbes/breakes the processing and you will NEVER be able to avoid and/or handle ALL of those disturbances/breakes, the mechanism behind restarting and process those unprocessed records needs to be reliable in sense that unprocessed changes need to be processed than.
In the current implementation, and that is the root cause for the instability, the timestamp of the last processed record is written to the domain.tld.db and the next loop/run of kolabd or "kolab sync" will user the youngest timestamp available in domain.tld.db to request current changes from LDAP but the unprocessed some records from before the disturbance/breake will never be processed here without a "koalb sync --resync" and this runs, if it gets through without disturbance/break, forever in our environment (18 hours and more).

Neither a phone call nor nitpicking more details will change this problem or the root cause this is a major design issue and has not necessarily andthing to do with stability of kolab(d) but with the behavior of preceeding normal processing after most kinds of disturbances in networking, filesystem, related services, it self.

We are not asking you to handle any kind of instability in the surrounding systems or infrastructure, but need kolabd to reliably restart processing the things that where not processed due to any reason except a "can't write or even acces domain.tld.de".

How to achieve this in an easy way I already described above.

  1. add a second timestamp column to domain.tld.db
  2. let "def synchronize" update the "[ldap] last_change" column only
  3. let some other function sync on all domain.tld.db entries where [ldap] last_change >= [kolab] last_change and update "[kolab] last_change = now" entry by entry after processing
  4. request LDAP user records based on max([kolab] last_chage) and additional regular filter settings and update and integrate domain.tld.db
  5. process as described in 2.

Since I'm not deep in (python) development, I can't write the code to do what I described, myself but I think I can find someone who could. Do you need the python code? Since I explained the problem and the possible solution to so many people, so many times now, that I'd think, it could be easier to find someone out there to implement and contiribute it. Would you assure to accept this contribution after review in the next possible packaging?

We, you and us, would save much time by this.

We sifted through a lot of network dumps to verify the "wrong behaviour". The loadbalancer and all firewalls are sending the appropriate RST commands when a timeout is reached.

vanmeeuwen raised the priority of this task from 20 to Low.Mar 28 2019, 8:13 AM