简体   繁体   中英

Indexing large documents with pg_search and GIN

I'm using Rails 3 to build a forum for myself and my friends (I didn't use one out of the box for a variety of reasons) and I'm at the point where I'm trying to implement full text search for the forum. Nothing fancy - just if someone searches the string "morning", I want to be able to show a list of all forum threads with posts containing the word "morning". I've been using pg_search for searching, but it's slow (5+ seconds) since we're already at 300 forum threads with 200k+ posts, some of which are 4k+ characters just in a single post. So I have this migration for multisearch:

class CreatePgSearchDocuments < ActiveRecord::Migration
  def self.up
    say_with_time("Creating table for pg_search multisearch") do
      create_table :pg_search_documents do |t|
        t.text :content
        t.belongs_to :searchable, :polymorphic => true, :index => true
        t.timestamps null: false
      end
      add_index :pg_search_documents, :content, using: "gin"
      PgSearch::Multisearch.rebuild(Post)
      PgSearch::Multisearch.rebuild(Reply)
    end
  end
end

But it's failing when I run the migration with this error:

PG::ProgramLimitExceeded: ERROR:  index row size 3080 exceeds maximum 2712 for index "index_pg_search_documents_on_content"
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

So far Googling has gotten me the following:

  • GIN indexes are better than GIST indexes for handling 100,000+ lexemes. This implies to me that GIN indexes should be capable of handling posts that are only 700 words

  • I had a guess that this error was about a single value rather than the length of the document , and was worried this was caused by the fact that I allow a subset of HTML tags in forum posts, so instead of storing post.content I now store post.sanitized_content. This strips all HTML, then replaces punctuation with spaces, then strips duplicates, like so: ActionView::Base.full_sanitizer.sanitize(content).gsub(/[^\\w ]/, ' ').squeeze(" ") . This got the error message down to index row size 2848 exceeds maximum 2712 , so it clearly did something , but not enough.

  • I then sanity checked that pg_search actually allows me to use dynamic methods like that, and it's not just secretly failing silently. According to the docs, "However, if you call any dynamic methods in :against, the following strategy will be used", so they seem to be handled just fine.

Relevant pieces of my implementation of Post:

class Post < ActiveRecord::Base
  include PgSearch

  multisearchable against: [:subject, :sanitized_content]

  def sanitized_content
    ActionView::Base.full_sanitizer.sanitize(content).gsub(/[^\w ]/, ' ').squeeze(" ")
  end
end

(I have also tried remove :subject from the multisearchable-against array, in case it was an unsanitized subject causing the issues; this got me down to row size 2800 in the error, but didn't fix it.)

So... what am I missing? Shouldn't GIN indexes be able to handle large text documents? Do I need to turn my documents into tsvectors first like in this answer ? It keeps suggesting "full text indexing" but I thought that's what this was.

For the sake of future people Googling: tentatively, using

execute "CREATE INDEX idx_fts_search_content ON pg_search_documents USING gin(to_tsvector('english', content))

instead of

add_index :pg_search_documents, :content, using: "gin"

has fixed it. Index so far isn't doing much, it takes a good 8.1s to search everything, but at least the migration runs now!

EDIT: Missed an important thing. The actual command should be:

execute "CREATE INDEX idx_fts_post_content ON posts USING gin(to_tsvector('english', coalesce(\"posts\".\"content\"::text, '')))"

If you don't have the coalesce(), it won't use the index.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM