Know Your Tools: ActiveRecord Collections

Ryan Briones, Feb 22, 2011

tl;dr Don’t do in X (Ruby) what Y (your database) is meant to do.

A cautionary tale

A friend showed me a piece of Rails/ActiveRecord code he was replacing today that, though I was disappointed to see, wasn’t surprising in the landscape of existing Rails application standards. Given a fictional example:

class Location < ActiveRecord::Base
has_many :checkins
end

class Checkin < ActiveRecord::Base
belongs_to :venue
end

Someone had written the following code (translated to our example):

venue.checkins.map { |record| record.user_id }.slice(0..limit)

The end goal of this code is to get the first limit (we’ll come back to this) user_ids from a venue’s checkins data. This is all well and good while your friends are beta testing your hot new checkin-based startup, but what happens when we throw this query in an existing application with a large dataset?

First, this code is going to issue what essentially is a SELECT * FROM checkins WHERE venue_id = ? SQL statement. This is unbounded query so if your venue has had 1.5 million checkins, then the result set from your database is going to include all 1.5 million records. For simplicity’s sake, let’s say those results are represented by an array of arrays of Strings and Fixnums (probably true). Next each of those 1.5 million elements in the results array are converted into a Checkin object; all of the elements of the record are mapped to attributes of your object while being typecast to the appropriate types represented by the column type. We are left with an array of Checkin objects being returned from #checkins.

Next, we iterate over this array of Checkin objects using #map (which is an alias for #collect) and extract the user_id attribute from the object since that’s all our next bit of code (Rails view/report/aggregation method) cares to use. We produce an array of 1.5 million Fixnums.

Last, we call #slice on the array of Fixnums. We’re passing a range to say which portion of the array we care about. In this case it’s the first limit + 1. I say specifically limit + 1 because we’re using an inclusive range operator (..). If you specify limit = 5, this evaluates to 0..5 and when using 0-indexed arrays, like in Ruby, this returns the first 6 results.

Use less Ruby! Wait.. what?

Summary: Running the above code we’ve pulled 1.5 million rows from our database, converted them to Checkin objects, iterated over all 1.5 million Checkins to pluck out the user_id and then we truncate our results down to 6. So what’s wrong with this? It seems like perfectly valid Ruby code and it reads well too.

Hidden amongst all that Ruby code is a tool that not only was designed for these kinds of operations, it is actually optimized for them. It’s your database! We can easily update this code to push these costly operations off to the database with just a few changes:

venue.checkins.all(:select => "user_id", :limit => limit)

Now instead of 1.5 million records being returned from our database, we’re only returning five. On top of that because of the :select => "user_id", our result set is not only going to contain the user_id for each record. Our SQL will probably look something more like this:

SELECT user_id FROM checkins WHERE venue_id = ? LIMIT 5

Less data returned in our result set (both total size and record size), the less ActiveRecord has to convert and typecast for us. We’re also able to remove the Ruby manipulation of our data. We still will be creating 5 Checkin objects which will have 1 attribute: user_id.

Extra Credit

Taking optimization a step further, when needed we could make our query:

venue.checkins.all(:select => "DISTINCT user_id", :limit => limit)

This will only return unique user_ids that have checked in to our venue. In the “user checkin” domain, I venture to guess this has a strong possibility of being significantly less data.

Last when I personally see a query like this, I choose not to fear the SQL and do:

Checkins.connection.select_values("SELECT DISTINCT user_id FROM checkins WHERE venue_id = #{venue.id} LIMIT 5")

Now the final result is just an array of user_id Fixnums. This is probably needed in the case where we need to select a much larger result set or we need to do a large calculation or aggregation against the data. This case of LIMIT 5 is definitely overkill.

Leverage your strengths

You could take this as a tutorial on how to optimize your ActiveRecord queries, and you might get a little of that information out of this. What I hope you take a way is the importance of knowing your tools and environment. In the case of this database-driven Rails application that’s know what Ruby and your database do, how big your dataset is and what kind of data we actually need.

Can ActiveRecord get 1.5 million rows of data from your database, manipulate and slice your data for you? It will certainly try and probably succeed. In our example we sacrificed zero expressability for an efficiency our environment required. If we didn’t really need it (the application wasn’t blowing up), we’ve made our code more robust using the tools intended for these operations.

So when the time comes, and our startup blows up at SXSW, we’ll get less monitoring alerts.

Rss-icon Rss-icon-over
Archive

Archive