The best index scheme for polymorphic associations in Postgres

If you want a fast lookup on a database foreign-key, you index it. But when your foreign-key has two parts — an ID and a type — which should you index?

Nick Francisci

Published in

Towards Dev

6 min readFeb 6, 2023

(Ain’t got time? Skip to the end.)

It turns out there are a lot of options for indexing polymorphic associations. We could index either field alone, neither field, both (independently), or both with a compound index. But which is most efficient? Being the good developer I am, I looked up the answer on StackOverflow and found out…

… that half the answers said to use a compound index starting with ‘type’ and the other half suggested a compound index starting with ‘ID’. Since compound indexes must look up by their first column before accessing the second, these answers aren’t quite equivalent. None of the answers I found included benchmarks so there wasn’t much data to determine which to do. Since I couldn’t find a consensus answer, I decided to run a test for myself.

Methodology

To set up, I created two tables for arbitrary-named ‘alphas’ and ‘omegas’. Then, I created a series of other ‘index-testing tables’ each of which contained a polymorphic association that could reference either a row of the alpha table or one of the omega table. Each of these index-testing tables differed only by the way that the two association columns were indexed. In total I created seven index testing tables using a migration on my Rails 6.1 server, connected to Postgres locally:

Unindexed
Index only on ID
Index only on type
Independent indices on both type and ID
Compound index with ID major, type minor
Compound index with type major, ID minor
Indexed using Rails’ t.reference type, with polymorphic and index flags set to true. In practice, this generates the same table and index as #6. I tested it, but will treat it the same as #6 in discussion from here on

class CreatePolymorphicTestTables < ActiveRecord::Migration[6.1]
  def change
    create_table :alphas

    create_table :omegas

    create_table :cmpd_id_type_refs do |t|
      t.bigint :relation_id, null: false
      t.text :relation_type, null: false
    end

    add_index :cmpd_id_type_refs, [:relation_id, :relation_type]

    create_table :cmpd_type_id_refs do |t|
      t.bigint :relation_id, null: false
      t.text :relation_type, null: false
    end

    add_index :cmpd_type_id_refs, [:relation_type, :relation_id]

    create_table :id_only_refs do |t|
      t.bigint :relation_id, null: false, index: true
      t.text :relation_type, null: false
    end

    create_table :type_only_refs do |t|
      t.bigint :relation_id, null: false
      t.text :relation_type, null: false, index: true
    end

    create_table :independent_type_id_refs do |t|
      t.bigint :relation_id, null: false, index: true
      t.text :relation_type, null: false, index: true
    end

    create_table :reference_refs do |t|
      t.references :relation, null: false, polymorphic: true
    end

    create_table :unindexed_refs do |t|
      t.references :relation, null: false, polymorphic: true, index: false
    end
  end
end

Next, I populated my tables. First creating 100 Alphas and 100 Omegas. Then, I selected one random Alpha at a time and created a new row in every index-testing table that referenced it. Then I did the same for an Omega. I repeated this process 50,000 times so that each index-testing table contained 100,000 entries. This process ensured that every one of the seven index-testing tables had the same number of rows, in the same order, that referenced each alpha or omega. In other words, the tables’ contents were identical.

desc "Populates polymorphic test tables"

# Example call: `rake populate_polymorphic_test` to actually execute
task :populate_polymorphic_test => :environment do |task, args|
  start_time = Time.now
  puts "-- Started at #{start_time}"

  ActiveRecord::Base.transaction do
    100.times do |i|
      Alpha.create!(id: i+1)
      Omega.create!(id: i+1)
    end

    50000.times do
      idx = rand(100) + 1

      a = Alpha.find(idx)

      UnindexedRef.create!(relation: a)
      IdOnlyRef.create!(relation: a)
      TypeOnlyRef.create!(relation: a)
      IndependentTypeIdRef.create!(relation: a)
      CmpdIdTypeRef.create!(relation: a)
      CmpdTypeIdRef.create!(relation: a)
      ReferenceRef.create!(relation: a)

      idx = rand(100) + 1

      b = Omega.find(idx)

      UnindexedRef.create!(relation: b)
      IdOnlyRef.create!(relation: b)
      TypeOnlyRef.create!(relation: b)
      IndependentTypeIdRef.create!(relation: b)
      CmpdIdTypeRef.create!(relation: b)
      CmpdTypeIdRef.create!(relation: b)
      ReferenceRef.create!(relation: b)
    end
  end

  finish_time = Time.now
  puts "-- Finished at #{finish_time}. Elapsed time #{finish_time - start_time} seconds."
end

Last, I ran my queries. For each Alpha and each Omega I looked up all their associations in two tests, clearing the query cache before each. The first test consisted of plucking the ID columns and the second consisted of counting the number of associated records. Both of these operations occur primarily in the database, without causing Rails to instantiate an ApplicationRecord. This means that the lion’s share of the work ought to be database look-ups with less in-memory work in Rails to muddy the waters.

desc "Benchmark queries on polymorphic indexes"

# Example call: `rake run_polymorphic_query_test` to actually execute
task :run_polymorphic_query_test => :environment do |task, args|

  test_set = [*Alpha.all, *Omega.all]

  ActiveRecord::Base.connection.query_cache.clear

  p "---------------------------- Pluck ID Test ----------------------------"
  Benchmark.bm do |x|
    x.report("Unindexed") { test_set.each{|rl| rl.unindexed_refs.pluck(:id)} }
    x.report("ID Only") { test_set.each{|rl| rl.id_only_refs.pluck(:id)} }
    x.report("Type Only") { test_set.each{|rl| rl.type_only_refs.pluck(:id)} }
    x.report("Independent Indicies") { test_set.each{|rl| rl.independent_type_id_refs.pluck(:id)} }
    x.report("Reference (Compound Type-ID)") { test_set.each{|rl| rl.reference_refs.pluck(:id)} }
    x.report("Compound Type-ID") { test_set.each{|rl| rl.cmpd_type_id_refs.pluck(:id)} }
    x.report("Compound ID-Type") { test_set.each{|rl| rl.cmpd_id_type_refs.pluck(:id)} }
  end

  ActiveRecord::Base.connection.query_cache.clear

  p "---------------------------- Count Test ----------------------------"
  Benchmark.bm do |x|
    x.report("Unindexed") { test_set.each{|rl| rl.unindexed_refs.count} }
    x.report("ID Only") { test_set.each{|rl| rl.id_only_refs.count} }
    x.report("Type Only") { test_set.each{|rl| rl.type_only_refs.count} }
    x.report("Independent Indicies") { test_set.each{|rl| rl.independent_type_id_refs.count} }
    x.report("Reference (Compound Type-ID)") { test_set.each{|rl| rl.reference_refs.count} }
    x.report("Compound Type-ID") { test_set.each{|rl| rl.cmpd_type_id_refs.count} }
    x.report("Compound ID-Type") { test_set.each{|rl| rl.cmpd_id_type_refs.count} }
  end
end

Findings

My intuition would be that a compound index with ID major would perform the best becuase the ID field has the highest cardinality (IE. the most variation) and would therefore lead to a more useful index. So was I correct?

Kind of. The best indexing schemes were, from best to slightly less good

Either Compound Index
Independent Indexes
ID-Only Indexing

# Example Run Results
#
# The `real` timing seems to be the most stable and useful. However,
# don't trust the exact numbers in this example. There was a good
# amount of variation between runs.
#
# ---------------------------- Pluck ID Test ----------------------------
                               user       system     total    real
Unindexed                      0.219539   0.004578   0.224117 (  1.048511)
ID Only                        0.114350   0.006782   0.121132 (  0.244616)
Type Only                      0.144048   0.011984   0.156032 (  1.002924)
Independent Indicies           0.106722   0.007628   0.114350 (  0.223154)
Reference (Compound Type-ID)   0.106546   0.007616   0.114162 (  0.195699)
Compound Type-ID               0.109110   0.008631   0.117741 (  0.198535)
Compound ID-Type               0.110510   0.008175   0.118685 (  0.195146)

# ---------------------------- Count Test ----------------------------
                               user       system     total    real
Unindexed                      0.102973   0.006801   0.109774 (  0.933632)
ID Only                        0.078064   0.004718   0.082782 (  0.197286)
Type Only                      0.189893   0.002887   0.192780 (  1.047528)
Independent Indicies           0.080484   0.003792   0.084276 (  0.184939)
Reference (Compound Type-ID)   0.072188   0.008226   0.080414 (  0.158406)
Compound Type-ID               0.070968   0.008155   0.079123 (  0.152290)
Compound ID-Type               0.060601   0.012225   0.072826 (  0.147477)

Across many runs, the two compound indexes performed very similarly. They were close enough that it doesn’t seem to really matter which you choose. Any discrepancy between them was well within the margin of error of my testing. Some runs, ID-major would win and some would go to type-major.

Independent indexing was a very close runner up, but reliably appeared slightly slower than either compound index. ID-only indexing was about the same though reliably a bit slower yet.

The other two schemes were substantially slower. The performance of the type-only and unindexed tables were fairly equivalent, generally about 5–6x slower than the other schemes.

It is worth noting that before I settled on my final methodology, I tested with only 20,000 records (rather than 100,000) in each index-testing table. At that table size, the difference between the methods was much less distinct. Even a sequential scan on the un-indexed table performed quite similarly to every other scheme.

Conclusions (tl;dr)

If you’re writing an index for your polymorphic association, either compound association will do just fine.

If you’re worried about the index size or index-write times, you could also use an ID-only index and won’t notice the difference.

On a small (<10,000 rows) table you likely won’t notice any difference at all.

And in all cases, Postgres is wicked fast.

If you — enterprising web-dev that you are — enjoyed this article, you might also learn a quick tip from some of my other articles, shown below. And if you’re in the US and looking for employment, consider reading Why you, a software engineer, should sell pet drugs on the internet with me.

The best index scheme for polymorphic associations in Postgres

If you want a fast lookup on a database foreign-key, you index it. But when your foreign-key has two parts — an ID and a type — which should you index?

Methodology

Findings

Conclusions (tl;dr)

Written by Nick Francisci