Progressively Improving N+1 Queries in GraphQL

source: https://dev.to/ldsrogan/graphql-dataloader-27k5

In this post, we’ll look at the N+1 query problem that often arises in GraphQL and explore how to address it using DataLoader. Rather than diving too deeply into how to implement DataLoader step-by-step, we’ll focus on why you should consider using it. For reference, the final code discussed here is available on my GitHub.

The N+1 Problem

Let’s start with a well-known scenario: imagine we have a Post and Comment model with a one-to-many relationship. The GraphQL schema might look like this:

type Comment {
  id: Int!
  content: String!
}

type Post {
  id: Int!
  title: String!
  content: String!
  comments: [Comment!]!
}

type Query {
  posts: [Post!]!
}

const postResolver: Resolvers = {
  Post: {
    comments: async (parent, __, context) => {
      return await context.prisma.comment.findMany({
        where: { postId: parent.id },
      })
    },
  },
  Query: {
    posts: async (_, __, context) => {
      const posts = await context.prisma.post.findMany({
        take: 20,
      })
      return posts
    },
  },
}

When we implement a resolver that can fetch posts, without requesting comments, there’s usually no issue. The posts field in the query might result in just a single SQL query to the database. In this situation, there’s no performance problem—although you might still be over-fetching columns you don’t need, that’s a separate concern from the N+1 issue.

query ExampleQuery {
  posts {
    id
    comments {
      id
    }
  }
}

But what happens if the client’s query also includes the comments field? As GraphQL resolves fields in a nested manner, once posts are resolved, it will then proceed to resolve the comments for each individual Post. If you have N posts, that could mean N additional queries for the comments, resulting in a total of N+1 queries. This leads to significant network overhead and can quickly overload the database under heavier traffic.

Even if you don’t feel the performance hit in early testing or low-traffic scenarios, this N+1 pattern becomes a serious bottleneck as the application scales. It’s also hard to troubleshoot because each comment request is processed independently by the database, which might lead to inefficient query planning on the DB side.

Using Prefetch to Avoid N+1

const postResolver: Resolvers = {
  Query: {
    posts: async (_, __, context) => {
      const posts = await context.prisma.post.findMany({
        take: 20,
        include: { comments: true },
      })
      return posts
    },
  },
}

A common solution is to eagerly load related data upfront. For example, you can modify your posts resolver to also fetch all related comments in a single batch query. With this approach, you’ll typically see only two queries in your logs:

One query to fetch multiple Post records.
One query that uses WHERE IN operation to fetch the associated Comment records for those posts.

This solves the N+1 problem, but it isn’t perfect. If the client doesn’t request comments, you’re still fetching them anyway, resulting in unnecessary over-fetching.

Leveraging the `info` Object

Wouldn’t it be great if we could decide at runtime whether to fetch comments based on whether the client requested them? GraphQL’s info object, available in every resolver, contains all sorts of execution details, including which fields the client has requested.

import graphqlFields from 'graphql-fields'

const postResolver: Resolvers = {
  Query: {
    posts: async (_, __, context, info) => {
      const requestedFields = graphqlFields(info)
      const hasRequestedCommentsField =
        Object.keys(requestedFields).includes('comments')

      const posts = await context.prisma.post.findMany({
        take: 20,
        ...(hasRequestedCommentsField && { include: { comments: true } }),
      })
      return posts
    },
  },
}

By using a library like graphql-fields, we can parse the info object and check if the client requested comments. Based on that, we can conditionally add comments to the query or skip it.

query ExampleQuery {
  posts {
    id
  }
}
# SELECT `public`.`Post`.`id`, `public`.`Post`.`title`, `public`.`Post`.`content`, `public`.`Post`.`createdAt`, `public`.`Post`.`userId` FROM `public`.`Post` WHERE 1=1 ORDER BY `public`.`Post`.`id` ASC LIMIT ? OFFSET ? [20,0]

query ExampleQuery {
  posts {
    id
    comments {
      id
    }
  }
}
# SELECT `public`.`Post`.`id`, `public`.`Post`.`title`, `public`.`Post`.`content`, `public`.`Post`.`createdAt`, `public`.`Post`.`userId` FROM `public`.`Post` WHERE 1=1 ORDER BY `public`.`Post`.`id` ASC LIMIT ? OFFSET ? [20,0]
# SELECT `public`.`Comment`.`id`, `public`.`Comment`.`content`, `public`.`Comment`.`userId`, `public`.`Comment`.`postId`, `public`.`Comment`.`createdAt` FROM `public`.`Comment` WHERE `public`.`Comment`.`postId` IN (?,?,?,?,?,?,?,?,?,?) [1,2,3,4,5,6,7,8,9,10]

This approach means that if comments weren’t requested, no extra query runs. It’s a definite improvement. For APIs that don’t have very deep relationships, this might be sufficient. However, as relationships become more complex, maintaining these conditional loads via info can get cumbersome. Also, if we need the same resource in different resolvers, relying on info quickly becomes unmanageable.

Introducing DataLoader

type Comment {
  id: Int!
  content: String!
}

type Post {
  id: Int!
  title: String!
  content: String!
  comments: [Comment!]!
}

type Query {
  latestPosts: [Post!]!
  bestPosts: [Post!]!
}

Now let’s consider a slightly more complex scenario. Suppose we split our queries to fetch Post resources into two separate resolvers—one that fetches recent posts and another that fetches popular posts.

const postResolver: Resolvers = {
  Query: {
    latestPosts: async (_, __, context, info) => {
      const requestedFields = graphqlFields(info)
      const hasRequestedCommentsField =
        Object.keys(requestedFields).includes('comments')
      
      const posts = await context.prisma.post.findMany({
        take: 20,
        orderBy: { createdAt: 'desc' },
        ...(hasRequestedCommentsField && { include: { comments: true } }),
      })
      return posts
    },

    bestPosts: async (_, __, context, info) => {
      const requestedFields = graphqlFields(info)
      const hasRequestedCommentsField =
        Object.keys(requestedFields).includes('comments')
      
      const posts = await context.prisma.post.findMany({
        take: 3,
        orderBy: { views: 'desc' },
        ...(hasRequestedCommentsField && { include: { comments: true } }),
      })
      return posts
    },
  },
}

Without DataLoader, we might end up with duplicate queries, for example:

One query to fetch the 20 latest Posts.
One query to fetch their associated Comments.
One query to fetch the top 3 Posts.
Another query to fetch their Comments.

In this setup, we’re running two separate IN queries for comments—even though we could theoretically batch them into a single query. This is a common scenario in GraphQL: the client can ask for multiple different sets of resources in one go, and it’s up to us to serve all of them efficiently with minimal overhead.

Using DataLoader to Batch Queries

DataLoader is a small yet powerful library designed to solve these batching issues by leveraging the event loop. It’s not GraphQL-specific, but it’s often used alongside GraphQL servers and ORMs. The source code for DataLoader is actually quite short (around 400 lines, including comments). Its main idea is that calls to load() keys are batched and resolved together in a single function call, reducing the number of queries executed.

When creating a new DataLoader instance, you must define a “batch function” that receives an array of keys. This function is responsible for returning the data that corresponds to those keys. For example, if you have a list of Post IDs, the batch function fetches all the corresponding Comments in one go and then maps them back to their respective Posts in the correct order.

const postByIdLoader = new DataLoader(batchPostById)
postByIdLoader.load(3)
postByIdLoader.load(6)
postByIdLoader.load(9)

function batchPostById(keys: number[]) {
	console.log(keys) // [3, 6, 9]
}

const postByIdLoader2 = new DataLoader(batchPostById2)
postByIdLoader.load('a')
postByIdLoader.load('b')
postByIdLoader.load('c')

function batchPostById2(keys: string[]) {
	console.log(keys) // ['a', 'b', 'c']
}

Let’s revisit the first example scenario. Instead of resolving comments for each post individually, we’ll create a DataLoader instance with a batch function that takes an array of post IDs, queries all matching comments at once, and then distributes the results back to each post.

const postResolver: Resolvers = {
  Post: {
    comments: async (parent, __, context) => {
      return await context.prisma.comment.findMany({
        where: { postId: parent.id },
      })
    },
  },
  Query: {
    posts: async (_, __, context) => {
      const posts = await context.prisma.post.findMany({
        take: 20,
      })
      return posts
    },
  },
}

async function batchCommentByPostId(postIds: number[]) {
	const comments = await prisma.comment.findMany({
      where: { postId: { in: postIds } },
    })

    return postIds.map((postId) =>
      comments.filter((comment) => comment.postId === postId),
    )
}

DataLoader recommends creating a new instance per request. This ensures that caching doesn’t leak data between different client requests. In Apollo Server, you can initialize a DataLoader in the context function so that each incoming request gets its own DataLoader instance. Apollo’s documentation also suggests this approach.

const server = new ApolloServer({
  schema: makeExecutableSchema({ typeDefs, resolvers }),
  context: ({ req, res }) => {
    return {
      req,
      res,
      loaders: { commentByIdLoader: new DataLoader(batchCommentByPostId) },
    }
  },
})

Once we have DataLoader set up, we can remove our complex logic based on info and simply call commentByPostIdLoader.load(post.id) in the comments resolver. If the user queries 23 posts in total (from various sources), once all comments resolvers are invoked, these 23 keys are batched together, and a single query fetches all their comments. This eliminates the redundant queries entirely.

const postResolver: Resolvers = {
  Post: {
    comments: async (parent, _, context) => {
      const { loaders } = context
      return loaders.commentByPostIdLoader.load(parent.id)
    },
  },
  Query: {
    latestPosts: async (_, __, context) => {
      const posts = await context.prisma.post.findMany({
        take: 20,
        orderBy: { createdAt: 'desc' },
      })
      return posts
    },

    bestPosts: async (_, __, context) => {
      const posts = await context.prisma.post.findMany({
        take: 3,
        orderBy: { views: 'desc' },
      })
      return posts
    },
  },
}

What Happens Inside DataLoader?

query ExampleQuery {
  latestPosts {
    id
    comments {
      id
    }
  }

  bestPosts {
    id
    comments {
      id
    }
  }
}

# SELECT `public`.`Post`.`id`, `public`.`Post`.`title`, `public`.`Post`.`content`, `public`.`Post`.`createdAt`, `public`.`Post`.`authorId`, `public`.`Post`.`views` FROM `public`.`Post` WHERE 1=1 ORDER BY `public`.`Post`.`createdAt` DESC LIMIT ? OFFSET ? [20,0]
# SELECT `public`.`Comment`.`id`, `public`.`Comment`.`content`, `public`.`Comment`.`userId`, `public`.`Comment`.`postId`, `public`.`Comment`.`createdAt` FROM `public`.`Comment` WHERE `public`.`Comment`.`postId` IN (?,?,?,?,?,?,?,?,?,?) [6,7,10,3,5,8,9,1,2,4]
# SELECT `public`.`Post`.`id`, `public`.`Post`.`title`, `public`.`Post`.`content`, `public`.`Post`.`createdAt`, `public`.`Post`.`authorId`, `public`.`Post`.`views` FROM `public`.`Post` WHERE 1=1 ORDER BY `public`.`Post`.`views` DESC LIMIT ? OFFSET ? [3,0]

For those who are curious, here’s a quick look at what happens under the hood:

When you first call load(), DataLoader creates an internal _batch object and schedules a batch dispatch when the event loop is free.
Each subsequent load(key) call adds that key to the _batch object.
When the call stack clears, DataLoader executes its queued batch function (the one you defined when creating the instance) with all the accumulated keys.
The batch function returns a list of results, and DataLoader resolves all the pending promises simultaneously.
If the same key is requested multiple times, DataLoader returns the cached result from a previous load, preventing duplicate queries.

Wrapping Up

We’ve explored several ways to address the N+1 problem in GraphQL, starting from straightforward eager-loading strategies to more dynamic approaches using info, and finally to a robust batching solution with DataLoader.

I was inspired to write this because our team recently needed DataLoader, and explaining how and why it’s useful to my colleagues made me question how well I understood it myself. I hope this article helps others facing similar challenges.

Progressively Improving N+1 Queries in GraphQL

The N+1 Problem

Using Prefetch to Avoid N+1

Leveraging the `info` Object

Introducing DataLoader

Using DataLoader to Batch Queries

What Happens Inside DataLoader?

Wrapping Up

References

Read more

Using GraphQL Code Generator in React

Contribution to Open Source Projects

Deployment Automation using GitHub Actions with ECS

The N+1 Problem

Using Prefetch to Avoid N+1

Leveraging the info Object

Introducing DataLoader

Using DataLoader to Batch Queries

What Happens Inside DataLoader?

Wrapping Up

References

Read more

Using GraphQL Code Generator in React

Contribution to Open Source Projects

Deployment Automation using GitHub Actions with ECS

Leveraging the `info` Object