Speeding up ActiveRecord iterations on larger datasets

As the size of your database tables starts to grow, looping through data sets becomes a more expensive operation.

Commonly used tactics such as using #find_each and #find_in_batches are great. In addition to that, it’s also good to be aware of how data is transferred and parsed by the ActiveRecord query interface

There are two steps here which start becoming bottlenecks at scale –

  1. The time it takes to transfer query result data from your DB back to your application
  2. The time it takes for ActiveRecord to parse the result set and build the model object

One obvious approach is to limit the data you query by only selecting the columns in a model you need instead of blindly selecting all columns. After all, do you really intend on using every column?

Let’s take a closer look at how doing that would impact the above operation times. In the examples below I query one of the larger models in my application (User) repeatedly to benchmark the results when selecting all columns versus just one column.

The model is on the medium to large side (16 columns, 2 database indexes, over 10k records) and I find myself iterating through it on a regular basis since it’s one of the cornerstone tables of the application. Benchmarks were pulled from a non-development Heroku-hosted environment that has to connect to a remote database – a common setup in most production Rails applications.

1. Data Transfer

In the below example we repeatedly query the database but don’t parse the result into a model. This lets us isolate and measure the query time against the database.

As you can see, selecting a specific column is about 7x more efficient! It makes sense given that we’re returning about 90% less data by selecting 1 column

                
require 'benchmark'

N = 1_000
conn = ActiveRecord::Base.connection

Benchmark.bm do |x|
  # Include a random OFFSET in below queries so cached results are not returned

  x.report("all columns") do
    N.times do
      conn.execute("SELECT * FROM users LIMIT 1000 OFFSET #{(1000*rand).round};")
    end
  end

  x.report("one column") do
    N.times do
      conn.execute("SELECT email FROM users LIMIT 1000 OFFSET #{(1000*rand).round};")
    end
  end
end


       user     system      total        real
all columns  1.920000   2.510000   4.430000 ( 16.687199)
one column  0.370000   0.260000   0.630000 (  2.804193)
                
              

2. Parsing the Result Set

In the below example we use #find_by_sql to construct the model objects from the query result set, but we execute the same query each time so ActiveRecord and our Database should be returning some form of cached results. This lets us isolate and measure just the time it takes to build the model object.

Again, selecting 1 column is much more efficient, but only by a factor 1.6x. The overhead of parsing and instantiating model objects is pretty high and while we definitely have less data to parse, the overall time cost of this operation remains high in both cases.

                
N = 1_000

Benchmark.bm do |x|
  x.report("all columns") do
    N.times do
      User.find_by_sql("SELECT * FROM users LIMIT 1000;")
    end
  end

  x.report("one column") do
    N.times do
      User.find_by_sql("SELECT email FROM users LIMIT 1000;")
    end
  end
end

       user     system      total        real
all columns 29.530000   2.280000  31.810000 ( 34.459085)
one column 19.110000   1.040000  20.150000 ( 21.719567)
                
              

Putting it all together

We’ve learned 2 things –

  1. It’s preferable to limit the size of the dataset by selecting specific columns
  2. It’s somewhat speedier to avoid constructing ActiveRecord objects where possible

The above aren’t ground breaking conclusions, but it’s nice to know the magnitude of how much more or less efficient something is.

So if you’re looking to perform a simple operation based on a few columns, just query the DB directly

                
ActiveRecord::Base.connection.execute("SELECT email FROM users;").each do |record|
  email = record["email"]
  # ...
end
                
              

If you need to instantiate a model object, at least be sure to get some gain and limit the column set

                
User.select(:email).find_each do |user|
  # ...
end