Be careful with Django's .create_or_update()!

Although using Model.objects.create_or_update() with a Django model is extremely convenient, sometimes you might want to consider using it carefully with certain usage patterns.

Excessive Postgres WAL files

To allow point-in-time recovery (PITR), I usually setup Barman. Set it up on a remote server, get your Postgres instance to rsync files over to the Barman server - and you've got a set of backups which should allow you to recover your database from an earlier point in time.

For every UPDATE/INSERT, Postgres will write data to the WAL (Write-Ahead Log). This isn't a problem if you're not doing anything else with the finished WAL files, although you'll have increased disk I/O - it's probably just a minor increase you won't notice too much.

However if you're keeping the WAL and archiving it for backups, every update to a table row will be backed up:

$ barman list-backup all
golestandt 20160613T063007 - Mon Jun 13 06:36:06 2016 - Size: 4.5 GiB - WAL Size: 12.7 GiB

Ouch.

Finding the problem databases

Assuming you've got statistics enabled, use psql to show the statistics for each database:

postgres=# SELECT datname,tup_updated,tup_inserted,tup_deleted FROM pg_stat_database ORDER BY tup_updated DESC;
                datname                | tup_updated | tup_inserted | tup_deleted
---------------------------------------+-------------+--------------+-------------
 site1                                 |     1138475 |       191605 |      136569
 site2                                 |      153224 |        46650 |       12385

Now we've got a list of databases which are excessively updating and inserting new rows.

How did we get here?

By being too lazy with .create_or_update() in applications doing regular syncs.

If you've got Django code syncing with a third party service, it's easy to write code similar to:

for tweet in tweet_list:
    Tweet.objects.create_or_update(tweet_id=tweet['id'], defaults={
        'user': tweet['user']['screen_name'],
        'text': tweet['text'],
        'retweet_count': tweet['retweet_count'],
    })

If you're only updating a few objects occasionally - this is fine and does the job. However if you're updating hundreds of objects every hour - you could end up with hundreds of thousands of rows being updated on a weekly basis, even though most of the data is likely to stay the same.

Every field for the object gets updated every time the object gets saved. If you've got a model with more fields, this will add up very quickly to additional WAL data.

Avoiding updates

If only some of the synced data is actually being updated - just update the fields which will receive any updates:

for tweet in tweet_list:
    obj, created = Tweet.objects.get_or_create(tweet_id=tweet['id'], defaults={
        'user': tweet['user']['screen_name'],
        'text': tweet['text'],
        'retweet_count': tweet['retweet_count'],
    })

    if not created:
        # Update counts, but try to avoid excessive updates
        update_fields = []

        if obj.retweet_count != tweet['retweet_count']:
            obj.retweet_count = tweet['retweet_count']
            update_fields.append('retweet_count')

        if update_fields:
            obj.save(update_fields=update_fields)

In this example we're assuming tweet IDs or usernames won't ever change. This is probably a safe assumption for IDs, and slightly less so for usernames.

By switching to .get_or_create(), and using .save(update_fields=list) - we've significantly reduced the number of updates. If the object already exists and none of the fields have been changed - update_fields will be an empty list and we don't even bother with .save().

The only downside using this method is that it can be tedious having to check every single field for updates. We could probably improve this with a more generic solution - but at this point it's easier to use a third party package which solves the problem.

Django Dirty Fields

Django Dirty Fields is an easy alternative to writing your own code to check all of the fields of an object. Just add a DirtyFieldsMixin to your model and you can simplify your code:

for tweet in tweet_list:
    try:
        obj = Tweet.objects.get(tweet_id=tweet['id'])
    except Tweet.DoesNotExist:
        obj = Tweet(tweet_id=tweet['id'])

    obj.user = tweet['user']['screen_name']
    obj.text = tweet['text']
    obj.retweet_count = tweet['retweet_count']

    # Only save if needed
    if not obj.id:
        obj.save()
    elif obj.is_dirty():
        obj.save_dirty_fields()

Considerably easier to deal with!

The one huge advantage of this approach is that you can add as many fields as you want, and Django Dirty Fields will do the job of figuring out which fields to update for you. As before if none of the fields have been updated - the object won't be saved.

The Results

After trying to fix several projects:

$ barman list-backup all
golestandt 20160704T063009 - Mon Jul  4 06:33:29 2016 - Size: 4.6 GiB - WAL Size: 822.0 MiB

There's still a few old projects which haven't been optimised - however we've now managed to reduce the size of the weekly WAL to be smaller than the weekly snapshot.

Posted on: Sun 21 August 2016

Category: Django – Tags: django, python, postgresql