简体   繁体   中英

Indexing HTML with solr

I am crawling our large website(s) with nutch and then indexing with solr and the results a pretty good. However, there are several menu structures across the site that index and spoil the results of a query.

Each of these menus is clearly defined in a DIV so <div id="RHBOX"> ... </div> or <div id="calendar"> ...</div> and several others.

I need to, at some point, delete the content in these DIVS.

I am guessing that the right place is during indexing by solr but cannot work out how.

A pattern would look something like (<div id="calendar">).*?(<\\/div>) but i cannot get that to work in <tokenizer class="solr.PatternTokenizerFactory" pattern="(<div id="calendar">).*?(<\\/div>)" /> and I am not really sure where to put it in schema.xml.

When I do put that pattern in schema.xml does not parse.

I am adding this line so the edit sticks

have you looked at the HTML different HTML tokenizers available within solr ?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripWhitespaceTokenizerFactory

they should help you resolve this issue. you should not index the html tags themselves. however if you need to uniquely identify certain tags then you will need to create individual fields and store the contents of those special tags in those fields.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM