I have an archive of ~50 M Tweets. I want to see if other users mention each other. There are problems though: There is an account called facebook (www.twitter.com/facebook). I want to search for those tweets that mention this account and do not simply say facebook.
So my syntax using sunspot is:
search = FeedEntry.search do
without(:person_id,person.id) # No self referencing
fulltext "@#{person.username}" #Find those Feeds that mention this person
paginate :page => 1, :per_page => 1000000 #Make sure we dont paginate
end
Solr seems to neglect the @ sign totally and even when search putting the username in "" or '' it doesnt matter.
search = FeedEntry.search{fulltext "facebook -RT"}
=> <Sunspot::Search:{:start=>0, :defType=>"dismax", :fq=>["type:FeedEntry"], :rows=>30, :q=>"facebook -RT", :fl=>"* score", :qf=>"retweeters_text text_text"}>
>> search.total
=> 299525
What can I do? I have to go through those results and use ruby "include? "@facebook" to sort out the false positives which is time consuming.
I have the suspicion that it has to do with the tokenizer factories I am using: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
My config in the schema.xml is:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I think changing the StandardTokenizerFactory to WhitespaceTokenizerFactory would help in my case. Btw. is there a way to see which tokens these factories produced on my corpus?
My final question is do I need a re-indexing after changing the tokenizer? My assumption is yes.
Cheers Thomas
If you can parse hash tags, re-tweets, @name etc as you index these tweets and use separate fields in solr, then you will have more powerful search (IMHO).
Changing to whitespace tokenizer should help as you noted, and you will need to reindex. You will need to use the same tokenizer,analyzer during search as well.
The StandardTokenizerFactory
throws out punctuation, with the exception of a period not followed by whitespace. In particular, it throws out "@", so your @name search is doomed (as would be searches for complete email addresses. While ClassicTokenizerFactory
preserves email addresses, I believe it still throws out the "@" from @name.
The WhitespaceTokenizerFactory
will preserve @name, but it will treat it differently if it is followed by a comma ( @name
is not the same as @name,
) so it may still not be the right thing for you. You might end up wanting to use the PatternTokenizerFactory
where you specify exactly how you want to parse via regular expressions.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.