简体   繁体   中英

How to save fetched html content to database in apache nutch?

I'm using 1.8 version of apache nutch. I want to save crawled HTML content to postgre database to do this, I modify FetcherThread.java class as below.

  case ProtocolStatus.SUCCESS: // got a page
  pstatus = output(fit.url, fit.datum, content, status,
  CrawlDatum.STATUS_FETCH_SUCCESS, fit.outlinkDepth);
  updateStatus(content.getContent().length);              
  /*Added My code Here*/

But I want to use plug-in system instead of directly modifying FetcherThread class. To use plug-in system which extension points I need to use?

You could write a custom plugin and implement an extension of org.apache.nutch.indexer.IndexWriter to send the documents to Postgres as part of the indexing step. You'll need to index the raw content which requires NUTCH-2032 - this is in Nutch 1.11 so you will need to upgrade your version of Nutch.

Alternatively you could write a custom MapReduce job which would take a segments as input, read the content and send it to your DB in the reduce step.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM