I have to crawl around 30k to 50k domains with Nutch 1.x on EMR AWS service. It will be gradual ie, first crawl all pages and later only new or updated pages for these websites. For indexing, I am using Apache Solr. I have few queries for best practices with EMR
org.apache.hadoop.io.compress.ZStandardCodec
is a good option.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.