The Importance of Idempotence

August 15, 2014

tl;dr Idempotence helps create more robust systems.

Idempotence is a mathematical concept that should be understood by all developers.

An operation is considered idempotent when doing it more than once is the same as doing it once.

For example, multiplying by 1 is idempotent.

x * 1 == x * 1 * 1

Multiplying by 0 is also idempotent.

x * 0 == x * 0 * 0

The key concept to remember is that applying the operation once can have side effects, but applying it more than once will not do anything more that what was already done the first time.

Assignment is idempotent.

x := 4

You can assign 4 to x as much as you want, x will still be 4. But assigning 4 to x one time is different than zero time.

HTTP Verbs

HTTP verbs can be classified as idempotent or not.

DELETE is an idempotent verb. No matter how many times you do it after the first time, it will give the same result as the first time. For example, DELETE /users/4/contacts/3 could remove your contact with the ID 3. If you call it again, that contact has already been removed and nothing more should happen.

GET is also idempotent. In fact, it's more than idempotent. It is considered a safe method. Safe methods can be compared to multiplying by 1. Doing it zero, once or more times should have the same effect. All GET does is get a resource. For example, you should never use normal links to delete resources.

POST is not idempotent. Every time you do it, you can expect a side effect to happen. For example, every time you POST a contact form, an email is sent.

When it comes to APIs, that concept is well understood by consumers and providers. Designing around it will result in least astonishment.

See Wikipedia: Safe Methods and Idempotent methods.

Message Queues

Let's say you build a web app to manage events. You can add people to the invitees of an event. In other words, an event has many invitees. For quicker response time, you decided to send all emails from a worker. So when a user finalizes an event, a message is queued. A worker gets that message and sends the invite emails for that event.

This is a really common pattern, and if you are not doing it that way, you should really start to.

You realize that you had an SMTP problem and all your emails have not been sent for some time, and it's not even showing up in the logs! You think "oh, well, I'll just call the function to send the emails again", but you don't know for which events you should call that function.

Here comes idempotence.

When you execute a task in a worker, always make sure it's idempotent

For the email example, every time you send an email to an invitee, you can keep the datetime in the database, in the event invitee row. If it has already been sent (sent datetime is not NULL), don't send it again. As easy as that. Also, you might want to check if the event is finalized. If not finalized, do nothing.

More generally:

  • Check that your job is ready to be executed (e.g. event finalized). If not, do nothing.
  • Check that the job has not already been done. If done, do nothing.
  • Do job, keep datetime or something else in the database, log that it has been done (e.g. "INFO Invitation email for event 234 has been sent to john.doe@example.com").
  • Keep jobs granular. You have 5 emails to send? Queue a job (e.g. send_event_emails(event_id)) that will queue the 5 other jobs (e.g. send_event_email_to_invitee(event_id, invitee_id)).

You realize something went wrong? You can always call your function to send emails on all events. Still crashed when half of the emails were sent? Fix what was wrong and just call it again. Also, it's easy to inspect what emails have not been sent yet. Bonus, you can do some intelligence with the datetimes (how many emails a day do we send? what are the peak hours?).

Also, some message queues don't garantee that a message will be delivered only once. Amazon SQS is that way. Your workers should really only do idempotent tasks.

SQL Migrations

In the same spirit as the worker example above, when you do an SQL migration, do it in an idempotent way when possible.

For example, you decide to split the user table in two tables. One for basic informations (users) and one for all details that are not always important (profiles). You put a foreign key user_id in the profiles table. You have a migration that takes every row in users (SELECT * FROM users) and inserts a row in profiles with user data. You run it, and well, it crashes midway after 1 hour, because of some NULL value you didn't think about. You fix it and run it again, but you then realize that some users have already been processed and have two profiles.

The idempotent solution: instead of SELECT * FROM users, you can just select the users that don't have a profile row. That way you can run it as many times as you want. It will only process the few users that have not been processed yet. A big advantage of that method is that you can leave your app running in production while you do the migrations. When you are ready to deploy the new code that uses the profiles table, you can call your function again to make sure the latest users that signed up are also migrated. That example is not so great, because a user could change some information in the users table during the migration, but I guess you get the point.

Denormalized Data

You have an application where each user has many documents. They can search their documents by tags. Tags come from many sources. The title of the documents, the folders, actual tags, the names of the authors, etc. You decided to keep a table named tags where you keep all the tags for every documents, it looks like this: id (hash), tag (actual text of the tag), document_id (foreign key to the document). When you add an author to a document, it does an insert in the tags table. When you remove an author, it finds the good tag and removes it.

Someday, you see in the log there was an error every once in a while when inserting a tag because of an obscure character encoding issue. You fix the issue and deploy the new code. However, there are a lot of missing tags and you have no easy way to fix it manually.

Instead of just having functions of the type add_tag_for_new_author, you should have a function of the type update_tags_for_document. When you call that function, instead of just adding a tag for the author that was just added, it checks all the document, rebuilds the tags list and makes sure that the correct data is in the database. That way, the tags table is really managed as it should be: a cache. You could delete all rows from that table and just call update_tags_for_document on every documents. It takes 2s to update the tags for a document? Let the worker do it, queue a message.

Conclusion

If you were not aware of idempotence, I hope I convinced you to use it. Also, please note that I kept the examples simplistic for educational purpose.

Deploying a Pyramid App on AppFog

August 28, 2012

Last July, AppFog revealed it's official plans, which include a free plan that comes with 2GB of RAM. I then decided to give it a try and move a Pyramid app that I had running on a VPS for about a year. The app uses Python 2.7, PostgreSQL, Pyramid 1.3.2 and SQLAlchemy 0.7.8.

On the VPS, the app was running using Waitress behind nginx.

I had to make some minor modifications to the code to make it run on AppFog, which were not really documented anywhere. I decided to post it here, in the hope someone finds it when needed.

Configuration File

First of all, I created a new configuration file. I named it appfog.ini, but name it whatever you like. I removed the key sqlalchemy.url, since it is set in environment variables by AppFog. Also, for the app to know that it has to retreive the environment variables instead of using sqlalchemy.url, I added the key/value appfog = true.

SQLAlchemy Connection

In main function of __init__.py, I replaced the line where the engine is created from the configuration file to take into account that the connection information could come from environment variables. So I replaced:

engine = engine_from_config(settings, 'sqlalchemy.')

with:

if settings.get('appfog') == 'true':
engine = appfog_engine(settings)
else:
from sqlalchemy import engine_from_config
engine = engine_from_config(settings, 'sqlalchemy.')

And I defined appfog_engine later in the same file:

def appfog_engine(settings):
from sqlalchemy import create_engine
import os, json
all_config = json.loads(os.getenv("VCAP_SERVICES"))
config = all_config['postgresql-9.1'][0]['credentials']
connection_string = ('postgresql+psycopg2://%(username)s:%(password)s'
'@%(host)s:%(port)d/%(name)s')
engine = create_engine(connection_string % config)
return engine

Requirements

Like when you deploy on Heroku, you have to provide a requirements.txt file, that you generate by running:

$ pip freeze > requirements.txt

WSGI App

You then have to create a wsgi.py file that you place at the root of your project, beside setup.py and appfog.ini. In wsgi.py, you put:

import os
from paste.deploy import loadapp

os.system("python setup.py develop")
path = os.getcwd()
application = loadapp('config:appfog.ini', relative_to=path)

That's It

Your app is ready to upload using the regular instructions, using the af commmand line tool provided by AppFog.

Useful Stuff

Since my app was already in production for about a year, I had data in my PostgreSQL database that I had to migrate from my server to AppFog. One nice feature they provide is the ability to make a tunnel to your services. You can find the doc here. With the help of the tunnel, it was easy to use pg_dump and psql to migrate the database.

Generating Country List For HTML Select

December 13, 2011

During a client project using Paypal DirectPayment API, I had to make a <select> with all the countries with their country codes as the value. For example:

<option value="US">United States</option>
<option value="CA">Canada</option>

Paypal already gives this list.

Problem is that it's only in English and all in caps. What I needed is a French version and an English version, not in caps.

Geonames.org just happens to have what I needed. This Python script is what I coded to generate what I needed:

import urllib, json, unicodedata

def get_list(username='demo', lang='en'):
"""Fetches the json of all countries from geonames.org."""
params = urllib.urlencode({'lang': lang, 'username': username})
url = 'http://api.geonames.org/countryInfoJSON?%s' % params
f = urllib.urlopen(url)
response_text = f.read()
return json.loads(response_text)["geonames"]

def country_list_generator(country_list, func):
"""Returns a generator of countries.
They are sorted alphabetically,
and the func passed as argument is applied to each of them."""
ordered = sorted(country_list,
key=lambda k: strip_accents(k['countryName']))
return (func(country) for country in ordered)

def country_to_option(country):
"""Transforms a country dict to an html option tag,
The country code is used for the value."""
return ('<option value="%s">%s</option>\n' % \
(country['countryCode'], country['countryName'])).encode('utf-8')

def country_to_csharp_dict_pair(country):
"""Transforms a country dict to a C# Dictionary pair."""
return ('{"%s", "%s"},\n' % \
(country['countryCode'], country['countryName'])).encode('utf-8')

def strip_accents(s):
"""Removes the accents from a unicode string.
This function is used for sorting."""
return ''.join((c for c in unicodedata.normalize('NFD', s) \
if unicodedata.category(c) != 'Mn'))

if __name__ == '__main__':
country_list = get_list('demo', lang='en')
with open('output.txt', 'w') as f:
gen = country_list_generator(country_list, country_to_option)
f.writelines(gen)

Hope it's useful to someone else.

The code is on Github.