How to save fetched html content to database in apache nutch?

Question

I'm using 1.8 version of apache nutch. I want to save crawled HTML content to postgre database to do this, I modify FetcherThread.java class as below.

  case ProtocolStatus.SUCCESS: // got a page
  pstatus = output(fit.url, fit.datum, content, status,
  CrawlDatum.STATUS_FETCH_SUCCESS, fit.outlinkDepth);
  updateStatus(content.getContent().length);              
  /*Added My code Here*/

But I want to use plug-in system instead of directly modifying FetcherThread class. To use plug-in system which extension points I need to use?

Answer 1

You could write a custom plugin and implement an extension of org.apache.nutch.indexer.IndexWriter to send the documents to Postgres as part of the indexing step. You'll need to index the raw content which requires NUTCH-2032 - this is in Nutch 1.11 so you will need to upgrade your version of Nutch.

Alternatively you could write a custom MapReduce job which would take a segments as input, read the content and send it to your DB in the reduce step.

How to save fetched html content to database in apache nutch?

Question

1 answers

solution1
1 ACCPTED 2016-02-18 15:13:57

How to save fetched html content to database in apache nutch?

Question

1 answers

solution1 1 ACCPTED 2016-02-18 15:13:57

solution1
1 ACCPTED 2016-02-18 15:13:57