简体   繁体   中英

Using Solr for indexing HTML tags with attributes

I have crawled websites using Nutch and I have pushed crawled data to solr. Now I want to search content between specific tag with specific attribute value. For example,

 <h><title> title to search </title></h>
 <div id="abc">
     content to search
 </div>
 <div class="efg">
     other content to search
 </div>

I have seen this question( how to parse html with nutch and index specific tag to solr? ) but this does not have enough clarity.

I want to know that whether there is any plugin available or i need to write a customized plugin altogether. If i have to write a plugin, i just need few directions for handling html tags and attributes.

You could use the HTMLStripCharFilterFactory in your analyzer before tokenizing.

This filter strips HTML from the input stream . For more info have a look here

You can implement a Nutch filter (I like Jericho HTML Parser ) to extract only the parts of the page you need to index using DOM manipulation. You can use the TextExtractor class to grab clean text (sans HTML tags) to be used in your index. I usually save that data in custom fields.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM