N+1 and redundant queries: An exhaustive guide
ORMs make it very easy to query databases. Let’s talk about how to avoid redundant queries. We’ll focus on django, but these principles could be generalized.
Quick definitions
A redundant query is a query that is not necessary to fetch the data you need. Sometimes, the data is already in memory because you’ve previously fetched it. Other times, it’s N+1 queries. Both are extremely common in application servers.
N+1 queries are queries that occur when your application makes a series of queries that fetch related data, when it could have fetched it all in one query. For instance, making one query for a list of N books and N more queries to fetch each book’s author, rather than just fetching books and their authors in one or two queries.
(non n+1) redundant query examples
The main source of (non n+1) redundant queries is fetching data you’ve already fetched. Django’s ORM syntax is often more readable than dealing with python lists, so it’s easy to fall into this trap. Take the following example:
all_records = Record.objects.all()
first_record = all_records.first() # equivalently, all_records[0]
last_record = all_records.last()
Some might think the first line will perform a query and the second/third won’t, but it’s actually the opposite. Querysets are ’lazy,’ so the first line won’t execute a query. The second and third lines both perform LIMIT 1
queries. If you need all records and also want to single out the first or last, consider python list logic instead:
queryset = Record.objects.all() # no queries, lazy!
all_records_list = list(queryset) # one query
first_record = all_records_list[0] # no new query
last_record = all_records_list[-1] # no new query
Using the ORM to filter a queryset you’ve already fetched is another common source of redundant queries. Since queryset filtering happens at the database level, Django does not (and can’t!) translate filter expressions into python logic.
qs = Record.objects.all()
all_records_list = list(qs) # one query
# wrong approach: filter always returns a new queryset and fires a query
special_records_list = list(qs.filter(special=True))
# good approach: use python list logic
special_records_list = [ r for r in all_records_list if r.special ]
Other examples of common redundant queries are counting or checking if a queryset is empty:
qs = Record.objects.all()
all_records_list = list(qs) # one query
record_count = qs.count() # bad, count always fires off a query
record_count = len(all_records_list) # good, no new query
if qs: # bad! this will fire off a query
...
is_empty = qs.exists() # bad, exists always fires off a query
is_empty = len(all_records_list) == 0 # good, no new query
I’m not saying you should always avoid filter()
, count()
or exists()
, ideally you shouldn’t fetch records you don’t need. But, if you need to fetch all the records anyway, you should just use python list logic to filter or count them.
At the end of the day these kind of mistakes aren’t usually a big deal. Having a view fetch a bunch of records and fire off a couple extra queries to count them is not going to be a big performance hit. In big-O notation, these are constant, so we’re only adding O(1) queries. A more serious issue is making queries inside loops (N+1 problem). This is so problematic that django added tools to avoid them.
Related data, caching and eager-loading
Recall that every foreign key has a source and a destination. Let’s call the source the ‘child’ and the destination the ‘parent’, e.g. a book’s author is its parent, and an author’s books are its children.
The ORM makes accessing parents very easy via attributes. These attribute lookups will perform a query, unless they’re already cached
child = Children.objects.first() # a query
child.parent # another query
child.parent # no query, cached from previous attr lookup
Note that if you’re iterating children and accessing their parents, the ORM will perform a query for each child. This even happens if they all have the same parent, because the parent attribute caching is per-instance. Django doesn’t globally keep track of which IDs it already fetched.
for child in Children.objects.all(): # one query
print(child.parent) # one query per loop iteration. N+1 problem!
You can prefetch (which also caches) all of the parents by using select_related
. This approach will perform a single query with a SQL join to fetch all of the parents.
# one query (SELECT * FROM children JOIN parents ON children.parent_id = parents.id)
child = Children.objects.select_related('parent').first()
child.parent # no query, cache is pre-populated by the join
for child in Children.objects.select_related('parent').all(): # one query
print(child.parent) # no extra queries in the loop
Fetching parents is straightforward, but fetching children is more complex. Accessing children is not just an attribute lookup, it’s done by using the related manager, via whatever related_name you pass to the model’s foreign key field. Accessing children has all of the problems of accessing parents, but it’s worse because the ORM doesn’t cache the related manager calls.
parent = Parent.objects.first() # one query
list(parent.children.all()) # another query
list(parent.children.all()) # duplicate query, children relationships are not cached
for parent in Parent.objects.all(): # one query
print(list(parent.children.all())) # one query per loop iteration. N+1 problem!
Fortunately you can prefetch and cache all of the children by using prefetch_related
. These methods will perform a single extra query to fetch all of the children and cache them.
parent = Parent.objects.prefetch_related('children').first() # two queries
list(parent.children.all()) # no query, it's in the prefetch cache
for parent in Parent.objects.prefetch_related('children'): # two queries
print(list(parent.children.all())) # no additional query
select_related vs. prefetch_related?
select_related
performs a SQL join, and so only works on parent relationships. prefetch_related
performs a separate query and performs the join in python. After performing the first query, it packages all the parent IDs in a big IN clause, i.e. SELECT * FROM parents WHERE id IN (...)
. It works on child-relationships and many-to-many relationships. Prefetch also works on parent relationships, so you can use it in both cases. select_related may seem redundant, but it’s generally faster. If you’re fetching a child table and eager-loading 20 different parent tables, select_related is going to be much faster, because it’ll perform 1 query instead of 21.
Beware complex prefetches
If you want to prefetch filtered children for each parent, you can use special Prefetch
objects. These allow you finer control over the prefetch query by providing a custom queryset. Warning: these can surprise you!
prefetch_obj = Prefetch('children', queryset=Children.objects.filter(special=True))
for parent in Parent.objects.prefetch_related(prefetch_obj): # two queries
print(list(parent.children.filter(special=True))) # additional query!
for parent in Parent.objects.prefetch_related(prefetch_obj): # two queries
print(list(parent.children.all())) # no additional query, the all() actually returns prefetched filtered children!
It turns out that the .all()
method above actually returns the filtered children, and the .filter()
call isn’t cached and fires off a query. Both these cases are surprising. I’m sure the framework developers had a good reason to do this, you can argue it would also be confusing if it worked the other way around. I still recommend avoiding using any filters on prefetches entities because it’s difficult to remember this.
If you really need to filter, annotate or otherwise mess with prefetch-query, I recommend using the to_attr
keyword so the filtered children get stored as an explicit list,
prefetch_obj = Prefetch('children', queryset=Children.objects.filter(special=True), to_attr='special_children')
for parent in Parent.objects.prefetch_related(prefetch_obj): # two queries
print(parent.special_children) # no additional query
And if you really need to, you can compose Prefetch
and custom attributes to traverse multiple levels of relationships.
children_prefetch = Prefetch('children', queryset=Children.objects.filter(special=True), to_attr='special_children')
grandchildren_prefetch = Prefetch('special_children__children', queryset=GrandChildren.objects.filter(special=True), to_attr='special_grandchildren')
for parent in Parent.objects.prefetch_related(children_prefetch, grandchildren_prefetch): # three queries
print(parent.special_children) # no additional query
print(parent.special_children[0].special_grandchildren) # no additional query
prefetch_related
is powerful, but it can get confusing quickly. When your views and templates start relying on these loosely defined custom attributes (good luck statically typing those!), you start to couple your view/presentation logic against the fetching, making it difficult to re-use the fetching logic and the presentation logic in different use-cases.
Level two: Storing data in context
Even with eager loading, some queries may still be redundant. To prevent redundancy, it’s common to assemble some temporary data-structures (plain old dicts and lists) in your views, so you can process them before sending off data for rendering. This usually involves looping over querysets and indexing records by ID in a dictionary.
This works well if you’re doing everything in one function, but these functions tend to grow big, and you’ll find yourself repeating the same logic in different views. How can you split it up without repeating queries? Your view could just do all the fetching and pass this data to the template and any helpers, but then you’ll couple your view logic to your template and helper logic.
One neat solution I like is memoizing query functions that use the request object as a cache. I’ve created django-data-fetcher for this.
# query_helpers.py
from data_fetcher import cache_within_request
@cache_within_request
def get_product_types_by_id():
products = ProductType.objects.all()
products_by_id = {p.id: p for p in products}
return products_by_id
Now a view can call get_products_types_by_id
if it needs a product, and utilities and template helpers can also call this function without repeating the queries. This is especially useful for “lookup” data: tables that don’t have that many records, but are frequently used and worth loading completely into memory. Common examples include product types, categories, tags.
This decorator also supports functions with arguments, so long as they’re valid dictionary keys (i.e. hashable), as they’ll be used to index the cache.
@cache_within_request
def get_products_for_order(order_id:int):
products = Product.objects.filter(order_id=order_id)
return list(products)
This kind of memoization is great for user/tenant related data, it’s common for multiple helpers to need the same data about user preferences, permissions, etc.
Memoizing functions is a great way to avoid non n+1 redundant queries while keeping your code clean, but it’s awkward for batching (n+1) logic. For that, we need something more advanced.
Level three: Request-scoped data fetchers (advanced)
For complex data retrieval needs, data fetchers provide a structured way to batch-load and cache query results efficiently.
The idea is to split up data into resource units with unique IDs. These resources can be lists, single records, or anything. Consumers like the view, template, helpers, etc. request these resources via unique IDs. The view can optionally pre-populate data-fetchers so that downstream consumers don’t create extra queries.
To use datafetchers, we subclass the DataFetcher
class and define a batch_load_dict
method.
from data_fetcher import DataFetcher
class ProductByIdFetcher(DataFetcher):
def batch_load_dict(self,ids):
products = Product.objects.filter(id__in=ids)
return {p.id: p for p in products}
The batch_load_dict
takes a list of unique IDs and must return a dict of resources indexed by these UDs. In practice, get-model-by-id or parent-id is common enough to have shortcut subclasses.
Datafetchers have a simple public API:
get(id)
: fetch a single unitget_many(ids)
: fetch multiple unitsprefetch_keys(ids)
: prefetch units for sake of performance
Fetcher instances cache data in their own local attributes, so we need to make sure we’re using the same instance throughout the entire request. Rather than construct instances directly, we use DataFetcherClass.get_instance()
. This uses the same request-caching mechanism as @cache_within_request
to provide request-scoped singletons.
Request-cached functions and datafetchers can be used together, and datafetchers can call each other too.
The advantage of datafetchers is that they can be used to fetch in bulk or individually. They even have a prime(key,value)
method that can be used to populate their caches with data. If you’re feeling especially stingy with queries, you can have your ChildrenByParentFetcher
prime a ChildFetcher
with the children it fetches, and then your templates, helpers and views can just use fetchers everywhere without worrying about repeating queries.
Bulk logic can often be complex, so it’s recommended to unit-test data fetchers and use inheritance to share logic.
An example with various levels of optimization
Level zero: N+1 queries
Let’s start with an example. We want to display a list of parents, but we want to highlight the ones that have special=True
children. The naive django-101 way to do this would be with N+1 queries:
{# template.jinja2 #}
<ul>
{% for parent in object_list %}
<li>
{{ parent.name }}
{% if parent.children.filter(special=True).exists() %}
<span class="badge">Has special children!</span>
{% endif %}
</li>
{% endfor %}
</ul>
(We’re using jinja templating because it allows for us to call methods, django-templates are less blog-friendly)
Obviously, this is going to create N+1 queries. Let’s optimize this by prefetching the children in the view.
Level one: Using django’s eager-loading tools
Because we’re using a filtered relationship, we can’t just use a plain prefetch. We need to use a Prefetch
object to filter the children and store it on a custom attribute, then change which attribute we use in the template.
# view.py
def parent_list(request):
parents = Parent.objects.prefetch_related(
Prefetch(
'children',
queryset=Children.objects.filter(special=True),
to_attr="special_children"
)
).all()
return render(request, 'parent_list.html', {'object_list': parents})
{# template.jinja2 #}
{% for parent in object_list %}
...
{% if parent.special_children %}
...
This solution will only issue 2 queries, but it’s rather brittle. To copy this behavior in another view/template, we need to copy the logic from both the view and template.
Level two: Memoizing query logic with the request-cache decorator
Keep in mind that the cache decorator is great for redundant queries, but not batching. Only for completeness sake am I presenting this potential solution: indexing records in a request-scoped dictionary singleton.
# helpers.py
@cache_within_request
def _special_parent_by_children_dict(cls):
"""
This is kind of a request-scoped "global" or singleton
underscored for 'privacy', shouldn't be accessed too broadly
"""
return dict()
def get_bulk_special_children_by_parent(parent_ids):
cache_dict = _parent_by_children_dict()
missing_ids = [id for id in parent_ids if id not in cache_dict]
if missing_ids:
children = Children.objects.filter(special=True,parent_id__in=missing_ids)
for child in children:
cache_dict.setdefault(child.parent_id, []).append(child)
return {id: cache_dict[id] for id in parent_ids}
def get_special_children_for_parent(parent_id):
cache_dict = _parent_by_children_dict()
if parent_id not in cache_dict:
children = Children.objects.filter(special=True, parent_id=parent_id)
cache_dict[parent_id] = list(children)
return cache_dict[parent_id]
# views.py
def parents_list(request):
parents = Parent.objects.all()
get_bulk_special_children_by_parent([p.id for p in parents])
return render(request, 'parent_list.html', {'object_list': parents})
{# template.jinja2 #}
{% for parent in object_list %}
...
{# how you access helpers depends on your template configuration #}
{% if helpers.get_special_children_for_parent(parent.id) %}
...
Admittedly, this is actually more complicated than filtered prefetches, but keep in note this is not the intended use of @cache_within_request
. I picked this example to showcase datafetchers.
If you’re stuck with an approach like this and want to clean this up, note that you could group these 3 functions into a class, the decorator will work fine on static/class methods, just make sure you don’t store any data on the class/instance, since it might leak memory across requests. Even with this cleanup, there’s still a lot of cases to unit test!
Level three: Using datafetchers
Although there are shortcut subclasses that make it even easier, let’s write out a data-fetcher class:
# helpers.py
class SpecialChildrenByParentIdFetcher(Datafetcher):
@staticmethod
def batch_load_dict(parent_ids):
children = Children.objects.filter(special=True, parent_id__in=parent_ids)
children_by_parent_id = defaultdict(list)
for child in children:
children_by_parent_id[child.parent_id].append(child)
return children_by_parent_id
# views.py
def parents_list(request):
parents = Parent.objects.all()
fetcher = SpecialChildrenByParentIdFetcher.get_instance()
fetcher.prefetch_keys([p.id for p in parents]) # not necessary, but will prevent N+1 queries
return render(request, 'parent_list.html', {'object_list': parents})
{# template.jinja2 #}
{% for parent in object_list %}
...
{% if helpers.SpecialChildrenByParentIdFetcher.get_instance().get(parent.id) %}
...
Because the base DataFetcher class is thoroughly tested, you need only test that your batch_load_dict method works as expected.
Check your work
N+1 issues are often hard to notice in dev. I recommend checking your work with django-debug-toolbar, which has a panel that shows you how many queries were made and how long they took. It works with AJAX/HTMX without any additional configuration. If for some reason this isn’t ideal (e.g. you’re working on backend features rather than in the browser) you can also just log queries to your console by modifying your settings:
# settings.py
LOGGING = {
'version': 1,
'filters': {
'require_debug_true': {
'()': 'django.utils.log.RequireDebugTrue',
}
},
'handlers': {
'console': {
'level': 'DEBUG',
'filters': ['require_debug_true'],
'class': 'logging.StreamHandler',
}
},
'loggers': {
'django.db.backends': {
'level': 'DEBUG',
'handlers': ['console'],
}
}
}
It’s also possible to check for N+1 queries in tests. This works in both unittest and pytest-django,
# unittest
class MyTestCase(SimpleTestCase):
def test_foo_queries(self):
with self.assertNumQueries(2):
fetchy_function()
# pytest
def test_foo_queries(django_assert_num_queries, django_assert_max_num_queries):
with django_assert_num_queries(2):
fetchy_function()
# or test upper bound
with django_assert_max_num_queries(3):
fetchy_function()
Because views make lots of extra queries for things like auth, middleware, etc., it becomes difficult to predict how many queries should be made. I also don’t like to create too many objects in tests, because it slows them down. I rely on the debug toolbar to catch N+1 issues in the development phase. I’m not usually paranoid enough to test N+1 for many views. I’ll sometimes verify that API or CSV output views don’t have too many queries, because those are easy to miss.