简体繁体 English

提高刮板效率

[英]Increase web scraper efficiency

原文 2018-12-11 07:58:00 3 1 java/ elasticsearch/ web-scraping/ web-crawler

I am creating a java application to scrape data from a particular XXX website and I want to store a desired set of data into my MSSQL database. 我正在创建一个Java应用程序以从特定的XXX网站抓取数据，并且要将一组所需的数据存储到我的MSSQL数据库中。 The dataset is around 100000+ rows in MSSQL. 在MSSQL中，数据集大约有100000多行。

What I do is I scrape the data, process it according to my requirement, and then I store it in DB as well as my ElasticSearch set. 我要做的是刮取数据，根据需要对其进行处理，然后将其存储在数据库以及ElasticSearch集中。 The whole process takes around 2 days or more for a single run. 整个过程大约需要2天或更长时间才能运行一次。 I use JSoup for parsing data. 我使用JSoup解析数据。

What I want to know is that how can I increase the efficiency of my application so that I can scrape it and save it in lesser time. 我想知道的是，我如何才能提高应用程序的效率，以便可以将其抓取并在更短的时间内保存。 I have executor services for parallel execution of my process. 我有执行程序服务，用于并行执行我的流程。

1 个解决方案

Instead of hand-crafting such an application, you can rely on distributed Web-crawler technology such as StormCrawler . 您可以依赖于分布式Web爬网程序技术（例如StormCrawler）来代替手工制作此类应用程序。 It is even capable of indexing the pages into an ElasticSearch instance. 它甚至能够将页面索引到ElasticSearch实例中。

If you want to store additional information, you can easily implement a custom Bolt for the MSSQL part of your process. 如果要存储其他信息，则可以为过程的MSSQL部分轻松实现自定义Bolt 。 However, using this framework requires setting up a Apache Storm cluster environment, which might take some time and computational ressources. 但是，使用此框架需要设置Apache Storm集群环境，这可能需要一些时间和计算资源。 This will speed up the process you described above drastically. 这将大大加快您上面描述的过程。