Page MenuHomePhorge

stopping wallace leaves a process behind, restart fails
Closed, ResolvedPublic

Description

I'm running Kolab on Debian 9, wallace is:

||/ Name               Version             Architektur         Beschreibung
+++-==================-===================-===================-============
ii  wallace            0.8.11-0~kolab3     all                 Kolab Content-Filter

Wallace has the following processes running:

# ps aux | grep wallace
kolab    24174  0.0  0.6 496476 19988 ?        Sl   21:43   0:00 /usr/bin/python /usr/sbin/wallaced -l warning --fork --user kolab
kolab    24175  0.0  0.5 201548 18464 ?        S    21:43   0:00 /usr/bin/python /usr/sbin/wallaced -l warning --fork --user kolab
kolab    24176  0.0  0.5 201548 18472 ?        S    21:43   0:00 /usr/bin/python /usr/sbin/wallaced -l warning --fork --user kolab
kolab    24177  0.0  0.5 201548 18468 ?        S    21:43   0:00 /usr/bin/python /usr/sbin/wallaced -l warning --fork --user kolab
kolab    24178  0.0  0.5 201548 18468 ?        S    21:43   0:00 /usr/bin/python /usr/sbin/wallaced -l warning --fork --user kolab
kolab    24183  0.0  0.6 496476 18624 ?        S    21:43   0:00 /usr/bin/python /usr/sbin/wallaced -l warning --fork --user kolab
root     24227  0.0  0.0  14316  1028 pts/2    S+   21:43   0:00 grep wallace

After "systemctl stop wallace.service" we have one process left:

# ps aux | grep wallace
kolab    24183  0.0  0.6 496476 18624 ?        S    21:43   0:00 /usr/bin/python /usr/sbin/wallaced -l warning --fork --user kolab
root     24719  0.0  0.0  14316   940 pts/2    S+   21:43   0:00 grep wallace
`

Wallace is started with an init-script, not a systemd unit.

Details

Ticket Type
Task

Event Timeline

machniak lowered the priority of this task from Needs Triage to Normal.May 29 2019, 12:38 PM
machniak added a project: PyKolab.
machniak subscribed.

Confirmed. If I then try to start wallace, I got:

2019-05-29 11:36:24,741 pykolab.wallace WARNING [10320] Could not bind to socket on port 10026 on bind address localhost
2019-05-29 11:36:24,741 pykolab.wallace WARNING [10320] Could not shut down socket
2019-05-29 11:36:25,743 pykolab.wallace WARNING [10320] Could not shut down socket
2019-05-29 11:36:26,745 pykolab.wallace WARNING [10320] Could not shut down socket
2019-05-29 11:36:27,747 pykolab.wallace WARNING [10320] Could not shut down socket
2019-05-29 11:36:28,750 pykolab.wallace WARNING [10320] Could not shut down socket
...

and

# ps ax | grep wallace
 3595 ?        S      0:00 /usr/bin/python /usr/sbin/wallaced -l warning --fork --user kolab
10320 ?        Sl     0:00 /usr/bin/python /usr/sbin/wallaced -l warning --fork --user kolab
10321 ?        S      0:00 /usr/bin/python /usr/sbin/wallaced -l warning --fork --user kolab
10322 ?        S      0:00 /usr/bin/python /usr/sbin/wallaced -l warning --fork --user kolab
10323 ?        S      0:00 /usr/bin/python /usr/sbin/wallaced -l warning --fork --user kolab
10324 ?        S      0:00 /usr/bin/python /usr/sbin/wallaced -l warning --fork --user kolab

I've tried reading the wallace source but didn't see an easy fix. For now I use a cron job
with the following script to fix wallace up:

#!/bin/bash

# Normal wallace has five processes - one is a child of init,
# the others are childs of the first wallace process.
# After a restart one process remains, and we have two
# childs of init.
count=0

for ppid in $(ps -e -o pid,ppid,cmd | awk '/[\/]usr\/sbin\/wallaced/ { print $2 }' ); do
  if [ "$ppid" = "1" ]; then
    count=$(( $count + 1 ))
  fi
done


if [ "$count" -gt 1 ]; then
  echo "$count wallace processes have PPID 1 - restarting"
  systemctl stop wallace.service
  kill $(ps -e -o pid,ppid,cmd | awk '/[\/]usr\/sbin\/wallaced/ { print $1 }' )
  sleep 1
  systemctl start wallace.service
fi

No such issue on CentOS with wallace-0.8.11-7.1.el7.kolab_16.

My python is Python 2.7.13 from the debian packages:

||/ Name                               Version
+++-==================================-===============
ii  libpython2.7-stdlib:amd64          2.7.13-2+deb9u3
ii  python2.7                          2.7.13-2+deb9u3

What's your python version? On my non-Kolab CentOS I have:

python-2.7.5-77.el7_6.x86_64

I've looked at python bugs and found
https://bugs.python.org/issue35629, but I've not yet looked further.

The problem can be reproduced on CentoOS, and Ubuntu as well. There is no matter which distro is used and I’m sure that every current build is affected.

I’ve made my patch on an Ubuntu 18.04, but before that I verified the case on the mostly supported OS: CentoOS.

[root@kolab-centos imre]# cat /etc/centos-release
CentOS Linux release 7.6.1810 (Core)
[root@kolab-centos imre]# rpm -qa | grep wallace
wallace-0.8.10-6.4.el7.kolab_wf.noarch

This system is a fresh install, no production traffic on it, It has no load, but stopping the service is lasts for 90 seconds:

[root@kolab-centos imre]# ps axuw | grep wall
root      7557  0.0  0.0 112708   948 pts/1    R+   13:28   0:00 grep --color=auto wall
[root@kolab-centos imre]# systemctl start wallace
[root@kolab-centos imre]# ps axuw | grep wall
kolab     7571  1.0  0.4 594056 18764 ?        Sl   13:28   0:00 /usr/bin/python /usr/sbin/wallaced --fork -l warning --pid-file /run/wallaced/wallaced.pid
kolab     7572  0.0  0.4 299128 15840 ?        S    13:28   0:00 /usr/bin/python /usr/sbin/wallaced --fork -l warning --pid-file /run/wallaced/wallaced.pid
kolab     7573  0.0  0.4 299128 15840 ?        S    13:28   0:00 /usr/bin/python /usr/sbin/wallaced --fork -l warning --pid-file /run/wallaced/wallaced.pid
kolab     7574  0.0  0.4 299128 15840 ?        S    13:28   0:00 /usr/bin/python /usr/sbin/wallaced --fork -l warning --pid-file /run/wallaced/wallaced.pid
kolab     7575  0.0  0.4 299128 15848 ?        S    13:28   0:00 /usr/bin/python /usr/sbin/wallaced --fork -l warning --pid-file /run/wallaced/wallaced.pid
kolab     7580  0.0  0.4 594312 18116 ?        S    13:28   0:00 /usr/bin/python /usr/sbin/wallaced --fork -l warning --pid-file /run/wallaced/wallaced.pid
root      7582  0.0  0.0 112708   944 pts/1    R+   13:28   0:00 grep --color=auto wall
[root@kolab-centos imre]# time systemctl stop wallace

real    1m30.084s
user    0m0.011s
sys     0m0.019s
[root@kolab-centos imre]# ps axuw | grep wall
root      7593  0.0  0.0 112708   948 pts/1    R+   13:29   0:00 grep --color=auto wall
[root@kolab-centos imre]#

That 90 seconds is very similar to the TimeoutStopSec paramter of systemd, so its very likely that wallace cannot exit gracefully, and systemd has to kill it after a while.

The shutdown on Ubuntu (tested on Ubuntu 16.04, wallace 0.8.10-0~kolab2) is a bit faster, but wallace lefts an orphaned process:

root@kolab-illusion:~# ps axuw |grep wall
kolab    11138  0.0  1.2 562172 49568 ?        S    13:42   0:00 /usr/bin/python /usr/sbin/wallaced -l debug -d 9 --fork --user kolab

This orhaned process still keeps the socket open, so after the next start you get a running but not working wallace:

pykolab.wallace WARNING [13333] Could not shut down socket

If you try to debug wallace, and run it in foreground you also notice that wallace cannot be stoped with Ctrl-C, and on the top of that, you get some awful backtraces:

2019-06-12 14:08:03,428 pykolab.wallace DEBUG [13482] Switching real and effective user id to 412
2019-06-12 14:08:03,439 pykolab.wallace DEBUG [13484] Worker process PoolWorker-1 initializing
2019-06-12 14:08:03,444 pykolab.wallace DEBUG [13485] Worker process PoolWorker-2 initializing
2019-06-12 14:08:03,452 pykolab.wallace DEBUG [13487] Worker process PoolWorker-4 initializing
2019-06-12 14:08:03,447 pykolab.wallace DEBUG [13486] Worker process PoolWorker-3 initializing
Process PoolWorker-4:
Process PoolWorker-3:
Process PoolWorker-2:
Process PoolWorker-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    task = get()
2019-06-12 14:08:05,025 pykolab.wallace INFO [13482] Interrupted by user
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    task = get()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 376, in get
    task = get()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
2019-06-12 14:08:05,049 pykolab.wallace INFO [13492] Terminating heartbeat process
    task = get()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
    return recv()
    racquire()
    racquire()
    racquire()
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
2019-06-12 14:08:05,184 pykolab.wallace DEBUG [13493] Worker process PoolWorker-6 initializing
2019-06-12 14:08:05,190 pykolab.wallace DEBUG [13494] Worker process PoolWorker-7 initializing
2019-06-12 14:08:05,200 pykolab.wallace DEBUG [13496] Worker process PoolWorker-9 initializing
2019-06-12 14:08:05,195 pykolab.wallace DEBUG [13495] Worker process PoolWorker-8 initializing

So I can say wallace has some difficulties with exiting.

Going deep, these are different and independent problems:

  • The timer thread is started but never cancelled. It blocks the main process keeping it unstoppable.
  • In the main process (only without --fork) where SIGINT is handled, the MultiProcessing Pool is not closed. This affects that athough all sub processes get the SIGINT and exit (with a backtrace), but the pool immediately starts another subprocesses.

My suggestions are:

  • Exclude the while loop in Timer, that let it be finished after a timer.cancel(). (Calling a cancel() without leave the external loop goes into an infinite loop, spinning up the CPU.)

Another solution can be to deamonize the thread, like heartbeat thread does:

@@ -239,6 +238,7 @@
         s.listen(5)

         self.timer = Timer(180, self.pickup_spool_messages)
+        self.timer.daemon = True;
         self.timer.start()
  • Close the pool while KeyboardInterrupt, not just while SIGTERM in main process.
  • Ignore the SIGINT signals in child processes, that eliminates the annoying backtraces.
  • SIGTERM handler is not called while running without --fork. (The patch doesn't handle this, it's a minor problem: running without --fork is only for debugging purpose, in production rarely happen. )

Putting it together:

diff --git a/wallace/__init__.py b/wallace/__init__.py
index 0078ec1..e153f28 100644
--- a/wallace/__init__.py
+++ b/wallace/__init__.py
@@ -21,6 +21,7 @@ import asyncore
 from distutils import version
 import grp
 import multiprocessing
+import signal
 import os
 import pwd
 import traceback
@@ -98,16 +99,15 @@ def modules_heartbeat(wallace_modules):
             break

 def worker_process(*args, **kw):
+    signal.signal(signal.SIGINT, signal.SIG_IGN)
     log.debug(_("Worker process %s initializing") % (multiprocessing.current_process().name), level=1)

 class Timer(_Timer):
     def run(self):
-        while True:
-            while not self.finished.is_set():
-                self.finished.wait(self.interval)
-                self.function(*self.args, **self.kwargs)
-
-            self.finished.set()
+        while not self.finished.is_set():
+            self.finished.wait(self.interval)
+            self.function(*self.args, **self.kwargs)
+        self.finished.set()

 class WallaceDaemon(object):
     def __init__(self):
@@ -391,6 +391,8 @@ class WallaceDaemon(object):
         if os.access(conf.pidfile, os.R_OK):
             os.remove(conf.pidfile)

+        if self.timer is not None:
+            self.timer.cancel()
         if self.pool is not None:
             self.pool.close()
             self.pool.join()
@@ -521,6 +523,9 @@ class WallaceDaemon(object):
             exitcode = e
         except KeyboardInterrupt:
             exitcode = 1
+            self.timer.cancel()
+            self.pool.close()
+            self.pool.join()
             log.info(_("Interrupted by user"))
         except AttributeError, e:
             exitcode = 1

Please keep in mind, that I’m not a programmer, please handle it reservedly, review it twice.

On my system wallace has been updated to 0.8.13-0~kolab2 and restarts seem to work now. I'l have an eye on it. If other people can confirm the fix we can close the ticket.

Yes, I can confirm that 0.8.13-0~kolab2 solves the problem. I suggest to close this issue.

machniak claimed this task.