简体   繁体   中英

How do I build a query in Ruby on Rails that joins on the max of a has_many relation only and includes a select filter on that relation?

I'm struggling how to have Ruby on Rails do this query right... in short: to join on a has_many relation but only via the most recent record in that relation and then can apply a filter/select on that relation.

Here's a super simple variant that captures my struggle:


Let's say I have a table of Employees and a table of Employments . An employee has_many employments . An employment has a status of :active or :inactive .

class Employee < ActiveRecord::Base
  has_many :employments
end

class Employment < ActiveRecord::Base
  belongs_to :employee
end

To keep it simple, let's say there is one employee : Dan and he has two employments : an old one (by created_at ) that is :inactive and a new one that is :active .

dan = Employee.create(name: 'Dan')
Employment.create(employee: dan, created_at: 2.years.ago, status: :inactive)
Employment.create(employee: dan, created_at: 3.months.ago, status: :active)

So in effect, you could say: "Dan has worked twice and is currently actively employed."

What I want is the Rails query for saying: "find me the employees who are inactive". And that should return an empty set because Dan's latest employment is :active . So I can't just do: Employee.joins(:employments).where(employments: { status: :inactive }) because it would match the old employment and thus return the Dan employee record.

I need a way to say: "find the employees who are inactive based on the most recent employment record only".

But I don't know how to do that in Rails.

I feel like I'm missing something... that it should be quite simple... but I can't figure it out.

Thanks!

I struggled quite a bit with the exact same issue in an application with a huge amount of rows and after trying various novel solutions like lateral joins and subqueries the best performing and by far simplest solution was just to add a foreign key to the table that points to the latest row and use an association callback (or a db trigger ) to set the foreign key.

class AddLatestEmploymentToEmployees < ActiveRecord::Migration[6.0]
  def change
    add_reference :employees, :latest_employment, foreign_key: { to_table: :employments }
  end
end

class Employee < ActiveRecord::Base
  has_many :employments, after_add: :set_latest_employment
  belongs_to :latest_employment, 
    class_name: 'Employment',
    optional: true

  private
  def set_latest_employment(employment)
    update_column(:latest_employment_id, employment.id)
  end 
end

Employee.joins(:latest_employment)
        .where(employments: { status: :active })

It really shines if the amount of associated records is huge like it was in my case as you can eager load the latest record without the memory issues which occur if you load the entire has_many association.

One alternative is to use a LATERAL JOIN which is a Postgres 9.3+ specific feature which can be described as something like a SQL foreach loop.

class Employee < ApplicationRecord
  has_many :employments
  def self.in_active_employment
    lat_query = Employment.select(:status)
                      .where('employee_id = employees.id') # lateral reference
                      .order(created_at: :desc)
                      .limit(1)
    joins("JOIN LATERAL(#{lat_query.to_sql}) ce ON true")
      .where(ce: { status: 'active' })
  end
end

This fetches the latest row from employments and then uses this in the WHERE clause to filter the rows from employees.

SELECT "employees".* FROM "employees" 
JOIN LATERAL(
  SELECT "employments"."status" 
  FROM "employments" 
  WHERE (employee_id = employees.id) 
  ORDER BY "employments"."created_at" DESC 
  LIMIT 1
) ce  ON true 
WHERE "ce"."status" = $1 LIMIT $2 

This is going to be extremely fast in comparison to a WHERE id IN subquery if the data set is large. Of course the cost is limited portability.

The simplest solution (based on code complexity) I can think of is first fetching the employment ids with their maximum values, then compsing a new query with the result.

attributes = %i[employee_id created_at]
employments = Employment.group(:employee_id).maximum(:created_at)
              .map { |values| Employee.where(attributes.zip(values).to_h) }
              .reduce(Employment.none, :or)
              .where(status: :inactive)

employees = Employee.where(id: employments.select(:employee_id))

This should produce the following SQL:

SELECT employments.employee_id, MAX(employments.created_at)
FROM employments
GROUP BY employments.employee_id

With the result the following query is build:

SELECT employees.*
FROM employees
WHERE employees.id IN (
  SELECT employments.employee_id 
  FROM employments
  WHERE (
    employments.employee_id = ? AND employments.created_at = ?
    OR employments.employee_id = ? AND employments.created_at = ?
    OR employments.employee_id = ? AND employments.created_at = ?
    -- ...
  ) AND employments.status = 'inactive'
)

The above method doesn't hold up well for large amounts of records, since the query grows for each additional employee. It becomes a lot easier when we can assume the higher id is made last. In that scenario the following would do the trick:

employment_ids = Employment.select(Employment.arel_table[:id].maxiumum).group(:employee_id)
employee_ids = Employment.select(:employee_id).where(id: employment_ids, status: :inactive)
employees = Employee.where(id: employee_ids)

This should produce a single query when employees is loaded.

SELECT employees.*
FROM employees
WHERE employees.id IN (
  SELECT employments.employee_id 
  FROM employments
  WHERE employments.id IN (
    SELECT MAX(employments.id)
    FROM employments
    GROUP BY employments.employee_id
  ) AND employments.status = 'inactive'
)

This solution works a lot better with larger datasets but you might want to look into the answer of max for better lookup performance.

+1 to @max's answer.

An alternative though is to add a start_date and end_date attribute to Employment . To get active employees, you can do

Employee
  .joins(:employments)
  .where('end_date is NULL OR ? BETWEEN start_date AND end_date', Date.today)

After fiddling for a while (and trying all these suggestions you all came up with, plus some others), I came up with this. It works, but maybe isn't the most elegant.

inner_query = Employment.select('distinct on(employee_id) *').order('employee_id').order('created_at DESC')
employee_ids = Employee.from("(#{inner_query.to_sql}) as unique_employments").select("unique_employments.employee_id").where("unique_employments.status='inactive'")
employees = Employee.where(id: employee_ids)

The inner query returns a collection of unique employments... the latest for each employee. Then based on that I pull the employee IDs that match the status. And last, find those employee records from the IDs

I don't love it, but it's understandable and does work.

I really appreciate all the input.

One big take-away for me (and anyone else who lands across this same/similar problem): max's answer helped me realize the struggle I was having with this code is a "smell" that the data isn't modeled in an ideal way. Per max's suggestion, if the Employee table has a reference to the latest Employment , and that's kept up-to-date and accurate, then this becomes trivially easy and fast.

Food for thought.

Since the title includes ARel . The following should work for your example:

employees = Employee.arel_table
employments = Employment.arel_table
max_employments = Arel::Table.new('max_employments')
e2 = employments.project(
      employments['employee_id'], 
      employments['id'].maximum.as('max_id')
     ).group(employments['employee_id'])
me_alias = Arel::Nodes::As.new(e2,max_employments)

res = employees.project(Arel.star)
      .join(me_alias).on(max_employments['employee_id'].eq(employees['id'])).
      .join(employments).on(employments['id'].eq(max_employments['max_id']))


Employee.joins(*res.join_sources)
  .where(employments: {status: :inactive})

This should result in the following

SELECT employees.* 
FROM employees 
INNER JOIN (
    SELECT 
       employments.employee_id, 
       MAX(employments.id) AS max_id 
    FROM employments 
    GROUP BY employments.employee_id
    ) AS max_employments ON max_employments.employee_id = employees.id 
INNER JOIN employments ON employments.id = max_employments.max_id
WHERE 
  employments.status = 'inactive'

In my opinion you can get those max dates first to sure not getting old records and then just filter for the required status. Here was the example of doing first part of it

https://stackoverflow.com/a/18222124/10057981

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM