简体   繁体   中英

Java crawl web and store in cassandra

I have a java project for which I'd like to use a pre-built web crawler that gives me enough flexibility to be able to control which urls are crawled and then once the crawler has the output I want to control where to put it (cassandra with my own schema).

The big picture is I want to feed in a list of urls (Google and Bing searches) and then filter the urls that are returned. I want it to then crawl the filtered urls (I may possibly want to change the url query string, but that's not a hard requirement). I want to take the resulting html and parse it using Tika then pull the data out and store it.

I'm looking at Apache Droids, it's a good fit since it seems to do everything I've mentioned but there isn't any real documentation. I'd consider Nutch or Heritrix but the use cases seem to be more a full solution and after skimming I don't see anything that talks about how to do what want.

Does anyone have any experience with this type of thing? I mostly need some recommendations, but if you know of examples doing this sort of thing that'd be nice as well since I'm still pretty new to java.

I wouldn't say Droids is a well established framework yet. If you compare it to Nutch, which has a lot of history behind, I would expect it to be less stable and less documented. I have no experience with Droids, though.

As far as storing data in cassandra, I would recommend either https://github.com/Netflix/astyanax or Hector https://github.com/hector-client/hector

I have used extensively Hector in the last year and have found it to be extremely simple and easy to use. It is faster to develop in Hector than its predecessors: pure Thrift/Pelops, but Hector is flexible enough to allow you to do the nitty gritty things which you expect from Thrift.

Recently I have also been eyeing astyanax as it is developed/supported by a larger team and tested on a larger scale, which is important for my current field of work. However, Hector is usually faster in implementing new features in new cassandra releases, so both libraries have their benefits.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM