As the size of your database tables starts to grow, looping through data sets becomes a more expensive operation.
Commonly used tactics such as using #find_each
and #find_in_batches
are great. In addition to that, it’s also good to be aware of how data is transferred and parsed by the ActiveRecord query interface
There are two steps here which start becoming bottlenecks at scale –
- The time it takes to transfer query result data from your DB back to your application
- The time it takes for ActiveRecord to parse the result set and build the model object
One obvious approach is to limit the data you query by only selecting the columns in a model you need instead of blindly selecting all columns. After all, do you really intend on using every column?
Let’s take a closer look at how doing that would impact the above operation times. In the examples below I query one of the larger models in my application (User
) repeatedly to benchmark the results when selecting all columns versus just one column.
The model is on the medium to large side (16 columns, 2 database indexes, over 10k records) and I find myself iterating through it on a regular basis since it’s one of the cornerstone tables of the application. Benchmarks were pulled from a non-development Heroku-hosted environment that has to connect to a remote database – a common setup in most production Rails applications.
1. Data Transfer
In the below example we repeatedly query the database but don’t parse the result into a model. This lets us isolate and measure the query time against the database.
As you can see, selecting a specific column is about 7x more efficient! It makes sense given that we’re returning about 90% less data by selecting 1 column
require 'benchmark'
N = 1_000
conn = ActiveRecord::Base.connection
Benchmark.bm do |x|
# Include a random OFFSET in below queries so cached results are not returned
x.report("all columns") do
N.times do
conn.execute("SELECT * FROM users LIMIT 1000 OFFSET #{(1000*rand).round};")
end
end
x.report("one column") do
N.times do
conn.execute("SELECT email FROM users LIMIT 1000 OFFSET #{(1000*rand).round};")
end
end
end
user system total real
all columns 1.920000 2.510000 4.430000 ( 16.687199)
one column 0.370000 0.260000 0.630000 ( 2.804193)
2. Parsing the Result Set
In the below example we use #find_by_sql
to construct the model objects from the query result set, but we execute the same query each time so ActiveRecord and our Database should be returning some form of cached results. This lets us isolate and measure just the time it takes to build the model object.
Again, selecting 1 column is much more efficient, but only by a factor 1.6x. The overhead of parsing and instantiating model objects is pretty high and while we definitely have less data to parse, the overall time cost of this operation remains high in both cases.
N = 1_000
Benchmark.bm do |x|
x.report("all columns") do
N.times do
User.find_by_sql("SELECT * FROM users LIMIT 1000;")
end
end
x.report("one column") do
N.times do
User.find_by_sql("SELECT email FROM users LIMIT 1000;")
end
end
end
user system total real
all columns 29.530000 2.280000 31.810000 ( 34.459085)
one column 19.110000 1.040000 20.150000 ( 21.719567)
Putting it all together
We’ve learned 2 things –
- It’s preferable to limit the size of the dataset by selecting specific columns
- It’s somewhat speedier to avoid constructing ActiveRecord objects where possible
The above aren’t ground breaking conclusions, but it’s nice to know the magnitude of how much more or less efficient something is.
So if you’re looking to perform a simple operation based on a few columns, just query the DB directly
ActiveRecord::Base.connection.execute("SELECT email FROM users;").each do |record|
email = record["email"]
# ...
end
If you need to instantiate a model object, at least be sure to get some gain and limit the column set
User.select(:email).find_each do |user|
# ...
end