简体繁体中英

Indexing HTML with solr

原文 2011-04-08 12:43:06 2 1 solr/ design-patterns/ nutch

I am crawling our large website(s) with nutch and then indexing with solr and the results a pretty good. However, there are several menu structures across the site that index and spoil the results of a query.

Each of these menus is clearly defined in a DIV so <div id="RHBOX"> ... </div> or <div id="calendar"> ...</div> and several others.

I need to, at some point, delete the content in these DIVS.

I am guessing that the right place is during indexing by solr but cannot work out how.

A pattern would look something like (<div id="calendar">).*?(<\\/div>) but i cannot get that to work in <tokenizer class="solr.PatternTokenizerFactory" pattern="(<div id="calendar">).*?(<\\/div>)" /> and I am not really sure where to put it in schema.xml.

When I do put that pattern in schema.xml does not parse.

I am adding this line so the edit sticks

1 answers

have you looked at the HTML different HTML tokenizers available within solr ?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripWhitespaceTokenizerFactory

they should help you resolve this issue. you should not index the html tags themselves. however if you need to uniquely identify certain tags then you will need to create individual fields and store the contents of those special tags in those fields.

HTML indexing with solr

Solr indexing HTML entities

Indexing HTML in Solr DataImportHandler

Indexing HTML files using SOLR

Stripping HTML in SOLR for storage, not indexing

Indexing pdf and html files in solr shows error in html indexing

Using Solr for indexing HTML tags with attributes

solr exclude html class from indexing

HTML sample file not indexing in Solr 8.8

indexing with solr

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question HTML indexing with solr Solr indexing HTML entities Indexing HTML in Solr DataImportHandler Indexing HTML files using SOLR Stripping HTML in SOLR for storage, not indexing Indexing pdf and html files in solr shows error in html indexing Using Solr for indexing HTML tags with attributes solr exclude html class from indexing HTML sample file not indexing in Solr 8.8 indexing with solr

Related Tags

Indexing HTML with solr

Question

1 answers

solution1 -1 2011-04-08 17:49:36

solution1
-1 2011-04-08 17:49:36