The Importance of Idempotence

August 15, 2014

tl;dr Idempotence helps create more robust systems.

Idempotence is a mathematical concept that should be understood by all developers.

An operation is considered idempotent when doing it more than once is the same as doing it once.

For example, multiplying by 1 is idempotent.

x * 1 == x * 1 * 1

Multiplying by 0 is also idempotent.

x * 0 == x * 0 * 0

The key concept to remember is that applying the operation once can have side effects, but applying it more than once will not do anything more that what was already done the first time.

Assignment is idempotent.

x := 4

You can assign 4 to x as much as you want, x will still be 4. But assigning 4 to x one time is different than zero time.

HTTP Verbs

HTTP verbs can be classified as idempotent or not.

DELETE is an idempotent verb. No matter how many times you do it after the first time, it will give the same result as the first time. For example, DELETE /users/4/contacts/3 could remove your contact with the ID 3. If you call it again, that contact has already been removed and nothing more should happen.

GET is also idempotent. In fact, it's more than idempotent. It is considered a safe method. Safe methods can be compared to multiplying by 1. Doing it zero, once or more times should have the same effect. All GET does is get a resource. For example, you should never use normal links to delete resources.

POST is not idempotent. Every time you do it, you can expect a side effect to happen. For example, every time you POST a contact form, an email is sent.

When it comes to APIs, that concept is well understood by consumers and providers. Designing around it will result in least astonishment.

See Wikipedia: Safe Methods and Idempotent methods.

Message Queues

Let's say you build a web app to manage events. You can add people to the invitees of an event. In other words, an event has many invitees. For quicker response time, you decided to send all emails from a worker. So when a user finalizes an event, a message is queued. A worker gets that message and sends the invite emails for that event.

This is a really common pattern, and if you are not doing it that way, you should really start to.

You realize that you had an SMTP problem and all your emails have not been sent for some time, and it's not even showing up in the logs! You think "oh, well, I'll just call the function to send the emails again", but you don't know for which events you should call that function.

Here comes idempotence.

When you execute a task in a worker, always make sure it's idempotent

For the email example, every time you send an email to an invitee, you can keep the datetime in the database, in the event invitee row. If it has already been sent (sent datetime is not NULL), don't send it again. As easy as that. Also, you might want to check if the event is finalized. If not finalized, do nothing.

More generally:

Check that your job is ready to be executed (e.g. event finalized). If not, do nothing.
Check that the job has not already been done. If done, do nothing.
Do job, keep datetime or something else in the database, log that it has been done (e.g. "INFO Invitation email for event 234 has been sent to john.doe@example.com").
Keep jobs granular. You have 5 emails to send? Queue a job (e.g. send_event_emails(event_id)) that will queue the 5 other jobs (e.g. send_event_email_to_invitee(event_id, invitee_id)).

You realize something went wrong? You can always call your function to send emails on all events. Still crashed when half of the emails were sent? Fix what was wrong and just call it again. Also, it's easy to inspect what emails have not been sent yet. Bonus, you can do some intelligence with the datetimes (how many emails a day do we send? what are the peak hours?).

Also, some message queues don't garantee that a message will be delivered only once. Amazon SQS is that way. Your workers should really only do idempotent tasks.

SQL Migrations

In the same spirit as the worker example above, when you do an SQL migration, do it in an idempotent way when possible.

For example, you decide to split the user table in two tables. One for basic informations (users) and one for all details that are not always important (profiles). You put a foreign key user_id in the profiles table. You have a migration that takes every row in users (SELECT * FROM users) and inserts a row in profiles with user data. You run it, and well, it crashes midway after 1 hour, because of some NULL value you didn't think about. You fix it and run it again, but you then realize that some users have already been processed and have two profiles.

The idempotent solution: instead of SELECT * FROM users, you can just select the users that don't have a profile row. That way you can run it as many times as you want. It will only process the few users that have not been processed yet. A big advantage of that method is that you can leave your app running in production while you do the migrations. When you are ready to deploy the new code that uses the profiles table, you can call your function again to make sure the latest users that signed up are also migrated. That example is not so great, because a user could change some information in the users table during the migration, but I guess you get the point.

Denormalized Data

You have an application where each user has many documents. They can search their documents by tags. Tags come from many sources. The title of the documents, the folders, actual tags, the names of the authors, etc. You decided to keep a table named tags where you keep all the tags for every documents, it looks like this: id (hash), tag (actual text of the tag), document_id (foreign key to the document). When you add an author to a document, it does an insert in the tags table. When you remove an author, it finds the good tag and removes it.

Someday, you see in the log there was an error every once in a while when inserting a tag because of an obscure character encoding issue. You fix the issue and deploy the new code. However, there are a lot of missing tags and you have no easy way to fix it manually.

Instead of just having functions of the type add_tag_for_new_author, you should have a function of the type update_tags_for_document. When you call that function, instead of just adding a tag for the author that was just added, it checks all the document, rebuilds the tags list and makes sure that the correct data is in the database. That way, the tags table is really managed as it should be: a cache. You could delete all rows from that table and just call update_tags_for_document on every documents. It takes 2s to update the tags for a document? Let the worker do it, queue a message.

Conclusion

If you were not aware of idempotence, I hope I convinced you to use it. Also, please note that I kept the examples simplistic for educational purpose.