简体   繁体   中英

solr sunspot exact search for words

I have an archive of ~50 M Tweets. I want to see if other users mention each other. There are problems though: There is an account called facebook (www.twitter.com/facebook). I want to search for those tweets that mention this account and do not simply say facebook.

So my syntax using sunspot is:

search = FeedEntry.search do        
  without(:person_id,person.id) # No self referencing
  fulltext "@#{person.username}" #Find those Feeds that mention this person
  paginate :page => 1, :per_page => 1000000 #Make sure we dont paginate 
end

Solr seems to neglect the @ sign totally and even when search putting the username in "" or '' it doesnt matter.

search = FeedEntry.search{fulltext "facebook -RT"}
=> <Sunspot::Search:{:start=>0, :defType=>"dismax", :fq=>["type:FeedEntry"], :rows=>30, :q=>"facebook -RT", :fl=>"* score", :qf=>"retweeters_text text_text"}>
>> search.total
=> 299525

What can I do? I have to go through those results and use ruby "include? "@facebook" to sort out the false positives which is time consuming.

I have the suspicion that it has to do with the tokenizer factories I am using: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory

My config in the schema.xml is:

<fieldType name="text" class="solr.TextField" omitNorms="false">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

I think changing the StandardTokenizerFactory to WhitespaceTokenizerFactory would help in my case. Btw. is there a way to see which tokens these factories produced on my corpus?

My final question is do I need a re-indexing after changing the tokenizer? My assumption is yes.

Cheers Thomas

If you can parse hash tags, re-tweets, @name etc as you index these tweets and use separate fields in solr, then you will have more powerful search (IMHO).

Changing to whitespace tokenizer should help as you noted, and you will need to reindex. You will need to use the same tokenizer,analyzer during search as well.

The StandardTokenizerFactory throws out punctuation, with the exception of a period not followed by whitespace. In particular, it throws out "@", so your @name search is doomed (as would be searches for complete email addresses. While ClassicTokenizerFactory preserves email addresses, I believe it still throws out the "@" from @name.

The WhitespaceTokenizerFactory will preserve @name, but it will treat it differently if it is followed by a comma ( @name is not the same as @name, ) so it may still not be the right thing for you. You might end up wanting to use the PatternTokenizerFactory where you specify exactly how you want to parse via regular expressions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM