« Back to blog

Stateless, fault-tolerant scheduling using randomness

I love stateless programs. They're simple to understand, and if they crash, you can just start them back up without worrying about any state being corrupted.

I recently wrote a daemon to index tweets from the Twitter API and throw those tweets into Solr for searching. Rather than worry about reconnecting on a connection error (such as Twitter hiccuping) and complicating the code, I coded it to just exit the process and let daemon tools restart the process. It was simple, stateless, and did the job.

I then wanted to periodically delete old data from Solr and optimize Solr's index. Since this is an expensive operation I wanted to do it infrequently (i.e., once a day). I also wanted this logic to exist within my indexing program to ensure I only had one process to worry about. This makes tweaking the logic and deploying easier.

My first thought was to do something like this:

 def deleting_thread():
   while True:
     time.sleep(3600*24)
     deleteFromSolrAndOptimize()
 

Unfortunately, this won't work very well in the face of random failures. If the process dies once every 12 hours, the deletion logic will never get run.

Alternatively, I could reorder my commands and try something like this:

 def deleting_thread():
   while True:
     deleteFromSolrAndOptimize()
     time.sleep(3600*24)
 

However, this has the opposite problem in that if there are lots of random failures (like Twitter is having a really bad day or is down), I could be sending potentially thousands of expensive commands to Solr.

It seems like the only solution is to either separate out this logic into a cron or keep external state in a file or database somewhere, losing the benefits of statelessness. Happily, though, there's a stateless solution that involves using random numbers.

The idea is to wake up our deletion thread once a second and randomly decide whether or not to call "deleteFromSolrAndOptimize()". If we set the probability correctly, we can make sure our function is being called at a sufficient but not unreasonable rate. The template for this looks like:

 def deleting_thread():
   while True:
     if random.randint(0,TRIGGER_RATE_SEC) == 0:
       deleteFromSolrAndOptimize()
     time.sleep(1)
 

To determine the proper value of TRIGGER_RATE_SEC, we need to determine how often and at what probability we want our function to be called. For my purposes, I want the function to be called at least once a day with 90% probability (which ensures 99% probability within two days). We must solve the equation:

Media_httprogercortes_hjwmh

Solving, we get that TRIGGER_RATE_SEC must be at most "3600*10" to meet our desired constraints.

Now, we need to make sure that our function isn't called too many times. To do this, we can use a variant of Chebyshev's inequality called "Cantelli's inequality", which states:

Media_httprogercortes_yhetr

Since our situation is modeled by the binomial distribution, our mean is "n*p" and our variance is "n*p*(1-p)". Plugging the numbers we see that our function will be called more than 8 times a day less than 5% of the time and more than 24 times per day less than 1% of the time. For my purposes, I can live with this.

The drawback to stateless scheduling through randomness is that occasionally your function will be invoked many more times than you originally intended. As long as the numbers aren't too unreasonable for the application though, I think the benefits of statelessness outweigh the inherent unpredictability.

by

| Viewed
times
Follow @BackTypeTech on Twitter!