简体   繁体   中英

How do I index HTML files into Apache SOLR?

默认情况下,SOLR接受XML文件,我想对数百万个已爬网的URL(html)执行搜索。

Usually, the first step I would recommend rolling your own application using SolrJ or similar to handle the indexing, and not do it directly with the DataImportHandler.

Just write your application and have that output the contents of those web pages as a field in a SolrInputDocument. I recommend stripping the HTML in that application, because it gives you greater control. Besides, you probably want to get at some of the data inside that pag, such as <title> , and index it to a different field. An alternative is to use HTMLStripTransformer on one of your fields to make sure it strips HTML out of anything that you send to that field.

How are you crawling all this data? If you're using something like Apache Nutch it should already take care of most of this for you, allowing you to just plug in the connection details of your Solr server.

Solr CEL可以接受HTML并将它们编入索引以进行全文搜索: http : //wiki.apache.org/solr/ExtractingRequestHandler

curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=@tutorial.html"

You can index downloaded html file with solr very well.

This is the fastest way that I did my indexing:

curl http://localhost:8080/solr/update/extract?stream.file=/home/index.html&literal.id=www.google.com

Here stream.file is the local path of your html file and literal.id is url from index.html .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM