Index the Raw HTML content using solr/lucene

Question

I have some htmls that I have scraped off the web during different period of time from the same site. and the raw data looks like this

timestamp, htmlcontent(500KB)
..

I have written a parser to parse out a few interesting fields from the HTML and I trying to build a search engine based on the fields that I parsed out. NOT JUST BASED ON THE RAW TEXT OF THE HTML BUT THE RAW COMPLETE HTML CONTENT>

now my data looks like:

timestamp, htmlcontent, parsedfield1, parsedfield2

I want the user search for timestamp, parsedfield1 or parsedfield2 and my search engine returns the raw HTML matching the user's query and populating the browser... so it feels like a search engine time machine :)

In this case, I am wondering how should I design the index? which fields should I store and which not. I am following the book "Lucene in Action" and wondering can anyone help me how to approach this problem..

Based on my understanding of Index, there are a few attributes in the schema.xml... index or not? store or not?.... I assume, "Whatever you want to include in the query result, it should be stored. " .. In that case, I have to store the column which contains the raw HTML...

Since that column is so big one record is usually about hundreds of KB... with only hundreds of rows.. you can easily get a dataset of almost 1GB... which won't work in solr and I am trying to index those columns using Lucene and it run into the heapsize problem..

Here is another idea: Maybe I should store the parsedfield1, parsedfield2 and pointer... where point column is the absolute path of the raw HTML file. Of course, in this case, I need to store each html into a separate file locally/or on HDFS... So when user search for parsedfield1, it will return the absolute path and I go and retrieve those files...

I think I am describing the problem as clearly as I can and wondering can anyone spend a minute giving me some directional guidance...

Much appreciated!

Answer 1

Some Guidelines 1. You need your data in XML or CSV or JSON format i will give you example of xml
eg.--> your data in xml format

<add>
    <doc>
        <field name="id">01</field>
        <field name="timestamp">somevalue</field>
        <field name="parsedfield1">your data 1</field>
        <field name="parsedfield2">Java data </field>
        <field name="htmlcontent">link to that html file</field>
    </doc>
</add>

2. You need to modify schema.xml

-- each document should have one unique id
-- as per your need you need to store only path for htmlcontent
-- other fields index only for searching

 <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
 <field name="timestamp" type="text_general" indexed="true" stored="false" />
 <field name="parsedfield1" type="text_general" indexed="true" stored="false"/>
 <field name="parsedfield2" type="text_general" indexed="true" stored="false" />
 <field name="parsedfield2" type="text_general" indexed="true" stored="false" />
 <field name="htmlcontent" type="text_general" indexed="true"  stored="true" />

3. you can use post.jar to post all XML files to solr or you can use SOLRJ APIs if you need programmatically

**Fields to be stored or not **
Fields on which you want to perform just search no need store unless you want to display them in result

Index the Raw HTML content using solr/lucene

Question

1 answers

solution1
0 2014-04-23 12:22:05

Index the Raw HTML content using solr/lucene

Question

1 answers

solution1 0 2014-04-23 12:22:05

solution1
0 2014-04-23 12:22:05