简体   繁体   English

提高刮板效率

[英]Increase web scraper efficiency

I am creating a java application to scrape data from a particular XXX website and I want to store a desired set of data into my MSSQL database. 我正在创建一个Java应用程序以从特定的XXX网站抓取数据,并且要将一组所需的数据存储到我的MSSQL数据库中。 The dataset is around 100000+ rows in MSSQL. 在MSSQL中,数据集大约有100000多行。

What I do is I scrape the data, process it according to my requirement, and then I store it in DB as well as my ElasticSearch set. 我要做的是刮取数据,根据需要对其进行处理,然后将其存储在数据库以及ElasticSearch集中。 The whole process takes around 2 days or more for a single run. 整个过程大约需要2天或更长时间才能运行一次。 I use JSoup for parsing data. 我使用JSoup解析数据。

What I want to know is that how can I increase the efficiency of my application so that I can scrape it and save it in lesser time. 我想知道的是,我如何才能提高应用程序的效率,以便可以将其抓取并在更短的时间内保存。 I have executor services for parallel execution of my process. 我有执行程序服务,用于并行执行我的流程。

Instead of hand-crafting such an application, you can rely on distributed Web-crawler technology such as StormCrawler . 您可以依赖于分布式Web爬网程序技术(例如StormCrawler)来代替手工制作此类应用程序。 It is even capable of indexing the pages into an ElasticSearch instance. 它甚至能够将页面索引到ElasticSearch实例中。

If you want to store additional information, you can easily implement a custom Bolt for the MSSQL part of your process. 如果要存储其他信息,则可以为过程的MSSQL部分轻松实现自定义Bolt However, using this framework requires setting up a Apache Storm cluster environment, which might take some time and computational ressources. 但是,使用此框架需要设置Apache Storm集群环境,这可能需要一些时间和计算资源。 This will speed up the process you described above drastically. 这将大大加快您上面描述的过程。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM