简体   繁体   中英

Search a text for multiple keywords

I need to search a text (around 500 words long) for words in an English dictionary (around 275,000 keywords) to detect non-English words and right now the query I am using is not really optimized which takes more than 10 seconds to execute (there's a words table and a texts table):

SELECT word FROM words WHERE 'The quick brown fox jumps over the lazy dog' LIKE CONCAT( '%', word, '%' );

Got the idea from here .

I have already set the word field as an Index and seen some examples of people storing the text in the database or putting it directly in the query.

Other examples showed people using FULLTEXT search although having 300k words I don't think a FULLTEXT will work, I guess it's good to search with logic +brown +lazy -apple but in my case I don't need much logic.

Another example I've seen is to concate words with the IN (...) clause although having 500m keywords the query would just be insanely long.

Any ideas what to do?

Right now the text is saved as a text field and the words as varchar(50) in InnoDB with utf8_unicode_ci encoding, I've heard InnoDB is slow so I could use MyISAM or any other. I am using MySQL 5.5 although I could update to 5.6 if that helped.

LIKE comparisons are basically just wildcard-capable equality tests. They are not a generic keyword search engine.

WHERE foo LIKE '%ab%' would find any records that contain the literal text ab anywhere in the foo field, they don't look for a or b separately, ab is a single monolithic "word" and that word is searched for in its entirety.

If you want to search for multiple "words" using LIKE , you have to do

WHERE foo LIKE '%a%' OR foo LIKE '%b%' OR etc...

which quickly gets ugly, and extremely inefficient - %... search cannot use indexes.

You'd be better off switching to a fulltext search system instead, where you can have the far simpler

WHERE MATCH(foo) AGAINST ('a b')

InnoDB can get pretty slow when you start getting into the millions of records. This is largely due to how it locks the rows when accessing the table.

I would use MyIsam so you can do a FULLTEXT search. Perhaps something like:

    select word from words where match(text) against(word)

I am not sure of the efficiency, but you do not really need to use the logic you were talking about, I don't think.

EDIT:

My code would really need to have more than one pass to work, as the second parameter really needs to be all of the words. You could use a FOR loop inside SQL to populate it, I suppose, but I will have to think about the code to do that. Probably a cursor, or stored procedure would do the trick.

I DO agree with the other answer, though, that you need to use a FULLTEXT search.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM