The best index scheme for polymorphic associations in Postgres

If you want a fast lookup on a database foreign-key, you index it. But when your foreign-key has two parts — an ID and a type — which should you index?

Nick Francisci
Towards Dev

--

Photo by Nathan Anderson on Unsplash

(Ain’t got time? Skip to the end.)

It turns out there are a lot of options for indexing polymorphic associations. We could index either field alone, neither field, both (independently), or both with a compound index. But which is most efficient? Being the good developer I am, I looked up the answer on StackOverflow and found out…

… that half the answers said to use a compound index starting with ‘type’ and the other half suggested a compound index starting with ‘ID’. Since compound indexes must look up by their first column before accessing the second, these answers aren’t quite equivalent. None of the answers I found included benchmarks so there wasn’t much data to determine which to do. Since I couldn’t find a consensus answer, I decided to run a test for myself.

Methodology

To set up, I created two tables for arbitrary-named ‘alphas’ and ‘omegas’. Then, I created a series of other ‘index-testing tables’ each of which contained a polymorphic association that could reference either a row of the alpha table or one of the omega table. Each of these index-testing tables differed only by the way that the two association columns were indexed. In total I created seven index testing tables using a migration on my Rails 6.1 server, connected to Postgres locally:

  1. Unindexed
  2. Index only on ID
  3. Index only on type
  4. Independent indices on both type and ID
  5. Compound index with ID major, type minor
  6. Compound index with type major, ID minor
  7. Indexed using Rails’ t.reference type, with polymorphic and index flags set to true. In practice, this generates the same table and index as #6. I tested it, but will treat it the same as #6 in discussion from here on
class CreatePolymorphicTestTables < ActiveRecord::Migration[6.1]
def change
create_table :alphas

create_table :omegas

create_table :cmpd_id_type_refs do |t|
t.bigint :relation_id, null: false
t.text :relation_type, null: false
end

add_index :cmpd_id_type_refs, [:relation_id, :relation_type]

create_table :cmpd_type_id_refs do |t|
t.bigint :relation_id, null: false
t.text :relation_type, null: false
end

add_index :cmpd_type_id_refs, [:relation_type, :relation_id]

create_table :id_only_refs do |t|
t.bigint :relation_id, null: false, index: true
t.text :relation_type, null: false
end

create_table :type_only_refs do |t|
t.bigint :relation_id, null: false
t.text :relation_type, null: false, index: true
end

create_table :independent_type_id_refs do |t|
t.bigint :relation_id, null: false, index: true
t.text :relation_type, null: false, index: true
end

create_table :reference_refs do |t|
t.references :relation, null: false, polymorphic: true
end

create_table :unindexed_refs do |t|
t.references :relation, null: false, polymorphic: true, index: false
end
end
end

Next, I populated my tables. First creating 100 Alphas and 100 Omegas. Then, I selected one random Alpha at a time and created a new row in every index-testing table that referenced it. Then I did the same for an Omega. I repeated this process 50,000 times so that each index-testing table contained 100,000 entries. This process ensured that every one of the seven index-testing tables had the same number of rows, in the same order, that referenced each alpha or omega. In other words, the tables’ contents were identical.

desc "Populates polymorphic test tables"

# Example call: `rake populate_polymorphic_test` to actually execute
task :populate_polymorphic_test => :environment do |task, args|
start_time = Time.now
puts "-- Started at #{start_time}"

ActiveRecord::Base.transaction do
100.times do |i|
Alpha.create!(id: i+1)
Omega.create!(id: i+1)
end

50000.times do
idx = rand(100) + 1

a = Alpha.find(idx)

UnindexedRef.create!(relation: a)
IdOnlyRef.create!(relation: a)
TypeOnlyRef.create!(relation: a)
IndependentTypeIdRef.create!(relation: a)
CmpdIdTypeRef.create!(relation: a)
CmpdTypeIdRef.create!(relation: a)
ReferenceRef.create!(relation: a)

idx = rand(100) + 1

b = Omega.find(idx)

UnindexedRef.create!(relation: b)
IdOnlyRef.create!(relation: b)
TypeOnlyRef.create!(relation: b)
IndependentTypeIdRef.create!(relation: b)
CmpdIdTypeRef.create!(relation: b)
CmpdTypeIdRef.create!(relation: b)
ReferenceRef.create!(relation: b)
end
end

finish_time = Time.now
puts "-- Finished at #{finish_time}. Elapsed time #{finish_time - start_time} seconds."
end

Last, I ran my queries. For each Alpha and each Omega I looked up all their associations in two tests, clearing the query cache before each. The first test consisted of plucking the ID columns and the second consisted of counting the number of associated records. Both of these operations occur primarily in the database, without causing Rails to instantiate an ApplicationRecord. This means that the lion’s share of the work ought to be database look-ups with less in-memory work in Rails to muddy the waters.

desc "Benchmark queries on polymorphic indexes"

# Example call: `rake run_polymorphic_query_test` to actually execute
task :run_polymorphic_query_test => :environment do |task, args|

test_set = [*Alpha.all, *Omega.all]

ActiveRecord::Base.connection.query_cache.clear

p "---------------------------- Pluck ID Test ----------------------------"
Benchmark.bm do |x|
x.report("Unindexed") { test_set.each{|rl| rl.unindexed_refs.pluck(:id)} }
x.report("ID Only") { test_set.each{|rl| rl.id_only_refs.pluck(:id)} }
x.report("Type Only") { test_set.each{|rl| rl.type_only_refs.pluck(:id)} }
x.report("Independent Indicies") { test_set.each{|rl| rl.independent_type_id_refs.pluck(:id)} }
x.report("Reference (Compound Type-ID)") { test_set.each{|rl| rl.reference_refs.pluck(:id)} }
x.report("Compound Type-ID") { test_set.each{|rl| rl.cmpd_type_id_refs.pluck(:id)} }
x.report("Compound ID-Type") { test_set.each{|rl| rl.cmpd_id_type_refs.pluck(:id)} }
end

ActiveRecord::Base.connection.query_cache.clear

p "---------------------------- Count Test ----------------------------"
Benchmark.bm do |x|
x.report("Unindexed") { test_set.each{|rl| rl.unindexed_refs.count} }
x.report("ID Only") { test_set.each{|rl| rl.id_only_refs.count} }
x.report("Type Only") { test_set.each{|rl| rl.type_only_refs.count} }
x.report("Independent Indicies") { test_set.each{|rl| rl.independent_type_id_refs.count} }
x.report("Reference (Compound Type-ID)") { test_set.each{|rl| rl.reference_refs.count} }
x.report("Compound Type-ID") { test_set.each{|rl| rl.cmpd_type_id_refs.count} }
x.report("Compound ID-Type") { test_set.each{|rl| rl.cmpd_id_type_refs.count} }
end
end

Findings

My intuition would be that a compound index with ID major would perform the best becuase the ID field has the highest cardinality (IE. the most variation) and would therefore lead to a more useful index. So was I correct?

Kind of. The best indexing schemes were, from best to slightly less good

  1. Either Compound Index
  2. Independent Indexes
  3. ID-Only Indexing
# Example Run Results
#
# The `real` timing seems to be the most stable and useful. However,
# don't trust the exact numbers in this example. There was a good
# amount of variation between runs.
#
# ---------------------------- Pluck ID Test ----------------------------
user system total real
Unindexed 0.219539 0.004578 0.224117 ( 1.048511)
ID Only 0.114350 0.006782 0.121132 ( 0.244616)
Type Only 0.144048 0.011984 0.156032 ( 1.002924)
Independent Indicies 0.106722 0.007628 0.114350 ( 0.223154)
Reference (Compound Type-ID) 0.106546 0.007616 0.114162 ( 0.195699)
Compound Type-ID 0.109110 0.008631 0.117741 ( 0.198535)
Compound ID-Type 0.110510 0.008175 0.118685 ( 0.195146)

# ---------------------------- Count Test ----------------------------
user system total real
Unindexed 0.102973 0.006801 0.109774 ( 0.933632)
ID Only 0.078064 0.004718 0.082782 ( 0.197286)
Type Only 0.189893 0.002887 0.192780 ( 1.047528)
Independent Indicies 0.080484 0.003792 0.084276 ( 0.184939)
Reference (Compound Type-ID) 0.072188 0.008226 0.080414 ( 0.158406)
Compound Type-ID 0.070968 0.008155 0.079123 ( 0.152290)
Compound ID-Type 0.060601 0.012225 0.072826 ( 0.147477)

Across many runs, the two compound indexes performed very similarly. They were close enough that it doesn’t seem to really matter which you choose. Any discrepancy between them was well within the margin of error of my testing. Some runs, ID-major would win and some would go to type-major.

Independent indexing was a very close runner up, but reliably appeared slightly slower than either compound index. ID-only indexing was about the same though reliably a bit slower yet.

The other two schemes were substantially slower. The performance of the type-only and unindexed tables were fairly equivalent, generally about 5–6x slower than the other schemes.

It is worth noting that before I settled on my final methodology, I tested with only 20,000 records (rather than 100,000) in each index-testing table. At that table size, the difference between the methods was much less distinct. Even a sequential scan on the un-indexed table performed quite similarly to every other scheme.

Conclusions (tl;dr)

If you’re writing an index for your polymorphic association, either compound association will do just fine.

If you’re worried about the index size or index-write times, you could also use an ID-only index and won’t notice the difference.

On a small (<10,000 rows) table you likely won’t notice any difference at all.

And in all cases, Postgres is wicked fast.

If you — enterprising web-dev that you are — enjoyed this article, you might also learn a quick tip from some of my other articles, shown below. And if you’re in the US and looking for employment, consider reading Why you, a software engineer, should sell pet drugs on the internet with me.

--

--

Engineer motivated by helping folks access the fundamentals of life: wholesome food, education, and healthcare. I also make video games. Web: nickfrancisci.com