Index ¦ Archives ¦ Atom

The cost of Dirty Fields

After installing Django Dirty Fields on projects a few months ago and seeing a dramatic reduction in the number of writes to our main Postgres database - everything seemed fine. However on a brand new project, something wasn't quite right performance wise:

$ siege --concurrent=1 --reps=10 "http://127.0.0.1:8000/map/?lat=51.4995&lng=0.1248"
...

Transactions:                     10 hits
Availability:                 100.00 %
Elapsed time:                  11.85 secs
Data transferred:               2.27 MB
Response time:                  0.88 secs
Transaction rate:               0.84 trans/sec
Throughput:                     0.19 MB/sec
Concurrency:                    0.75
Successful transactions:          10
Failed transactions:               0
Longest transaction:            0.96
Shortest transaction:           0.83

Painfully slow! Although it wasn't the most optimised code possible, an average of 880ms per request wasn't acceptable.

Investigating the cause

As the view for this request generates a JSON response, using Django Debug Toolbar wasn't a viable option - as all the debugging output gets attached to HTML responses only. So instead I decided to run through the code with shell_plus from Django Extensions and IPython.

After going through parts of the code, one of the querysets seemed slower than expected:

>>> %timeit venue_list = list(Venue.objects.all()[:100])
10 loops, best of 3: 101 ms per loop

Over 100ms to go through a fairly small queryset? This is far too slow. Just to see if it's a database problem, we'll change it to use values_list instead, which just returns a list of tuples:

>>> %timeit venue_list = Venue.objects.values_list('id', 'name', 'location')
10000 loops, best of 3: 80.3 ┬Ás per loop

And testing another model from another app as a quick sanity check to ensure there's no problems with other models:

>>> %timeit permission_list = list(Permission.objects.all())
100 loops, best of 3: 2.29 ms per loop

So something is obviously wrong with the queryset/model.

After seeing that this model had DirtyFieldsMixin, which was one obvious difference between this model and all the others, the next test was to remove it and see if that made any difference:

>>> %timeit venue_list = list(Venue.objects.all()[:100])
100 loops, best of 3: 8.04 ms per loop

From 101ms to 8ms.

After removing all instances of DirtyFieldsMixin, another performance test showed an improvement:

$ siege --concurrent=1 --reps=10 "http://127.0.0.1:8000/map/?lat=51.4995&lng=0.1248"
...

Transactions:                     10 hits
Availability:                 100.00 %
Elapsed time:                   7.98 secs
Data transferred:               2.27 MB
Response time:                  0.40 secs
Transaction rate:               1.25 trans/sec
Throughput:                     0.28 MB/sec
Concurrency:                    0.50
Successful transactions:          10
Failed transactions:               0
Longest transaction:            0.44
Shortest transaction:           0.36

From 880ms to 400ms - a big difference.

Testing django-model-utils

As Django Dirty Fields wasn't great for performance, an alternative which seems to offer similar functionality is the tracker field from django-model-utils.

Let's test by adding a FieldTracker field to a model:

>>> %timeit venue_list = list(Venue.objects.all()[:100])
10 loops, best of 3: 21.5 ms per loop

Much faster than Django Dirty Fields!

Some of the code when saving/updating objects needs updating for django-model-utils, as it doesn't have the same convenience methods:

if venue.id is None:
    venue.save()
else:
    changed_fields = venue.tracker.changed().keys()
    if changed_fields:
        venue.save(update_fields=changed_fields)

One subtle difference between the two packages is that you'll need to ensure the data you enter is the same type. It's possible to give an IntegerField a string value, and Django will still save it.

With Django Dirty Fields:

>>> venue = Venue.objects.get(id=14132)
>>> venue.grid_ref_x
442478
>>> venue.grid_ref_x = '442478'
>>> venue.is_dirty()
False
>>> venue.save()
>>> venue.grid_ref_x
'442478'
>>> venue.grid_ref_x = '123'
>>> venue.is_dirty()
True
>>> venue.save()
>>> venue.grid_ref_x
'123'
>>> venue.refresh_from_db()
>>> venue.grid_ref_x
123

With django-model-utils:

>>> venue = Venue.objects.get(id=14132)
>>> venue.grid_ref_x
442478
>>> venue.grid_ref_x = '442478'
>>> venue.tracker.changed()
{'grid_ref_x': 442478}

Which would result in a lot of additional saves for data, even though the saved data will end up being the same.

Using proxy models

Given that none of the code in the Django views used the dirty fields methods, and won't use the tracker field either - this seems like an ideal case for proxy models in Django. Instead of adding the tracker to the Venue model, we'll create a proxy model instead:

class VenueTracker(Venue):
    tracker = FieldTracker()

    class Meta:
        proxy = True

Now we've got two versions of the same model:

>>> %timeit venue_list = list(Venue.objects.all()[:100])
100 loops, best of 3: 8.85 ms per loop
>>> %timeit venue_list = list(VenueTracker.objects.all()[:100])
10 loops, best of 3: 21.2 ms per loop

So by default we'll use the standard Venue model in views for speed, however if we need a version with tracking for our update scripts, we simply change the imports to point to the model with a tracker:

from .models import VenueTracker as Venue

Now we can have fast views as normal, but with the option to switch to the tracked version if needed.

© Alex Tomkins.