简体繁体中英

Nutch best option for persistent storage in EMR for raw data

原文 2022-05-07 16:47:00 2 1 amazon-web-services/ amazon-s3/ amazon-emr/ nutch

I have to crawl around 30k to 50k domains with Nutch 1.x on EMR AWS service. It will be gradual ie, first crawl all pages and later only new or updated pages for these websites. For indexing, I am using Apache Solr. I have few queries for best practices with EMR

If I have to re-index or analyze old crawled data, I think raw data should be stored on S3. Is it the right option?
Is it better to configure S3 as back-end storage of HDFS for my first question or I should copy folder at the end of EMR job manually.
In either case, to optimize storage in S3 for raw data, how can I compress data when importing or exporting from/to EMR cluster to/from S3.
How can I instruct Nutch to crawl only new found pages from given seed

1 answers

Nutch is able to read/write directly from S3, see using-s3-as-nutch-storage-system .
Writing segments and CrawlDb directly to S3 makes sense. But to keep it on HDFS and then copying (distcp) to S3 is also possible.
See mapreduce.output.fileoutputformat.compress.codec - org.apache.hadoop.io.compress.ZStandardCodec is a good option.
(better ask this again separately) Do the crawled domains all provide sitemaps ? Otherwise, the challenge is to many new URLs with re-fetching as less possible known pages. If you want all new pages or make sure all removed pages are recognized as such, it's easier to recrawl everything.

Persistent UGC File Storage on AWS For Docker Application

AWS EMR: Does master node stores hdfs data in EMR cluster?

How do you clear the persistent storage for a notebook instance on AWS SageMaker?

Best way to automate AWS EMR Creation,termination and pyspark jobs

Use Snowpark python to unload snowflake data to S3. How to provide storage integration option

Persistent disk losing some data

What is the best way to archive structured data in low cost storage that can be accessed using API?

How to delete data from a persistent volume in GKE?

Unable to read data from mongoDB using Pyspark or Python in AWS EMR

Postgresql data on k8s cannot be made persistent

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Persistent UGC File Storage on AWS For Docker Application AWS EMR: Does master node stores hdfs data in EMR cluster? How do you clear the persistent storage for a notebook instance on AWS SageMaker? Best way to automate AWS EMR Creation,termination and pyspark jobs Use Snowpark python to unload snowflake data to S3. How to provide storage integration option Persistent disk losing some data What is the best way to archive structured data in low cost storage that can be accessed using API? How to delete data from a persistent volume in GKE? Unable to read data from mongoDB using Pyspark or Python in AWS EMR Postgresql data on k8s cannot be made persistent

Related Tags

Nutch best option for persistent storage in EMR for raw data

Question

1 answers

solution1 1 ACCPTED 2022-05-08 11:54:25

solution1
1 ACCPTED 2022-05-08 11:54:25