简体   繁体   English

Java 多线程网络爬虫,每秒连续提取数据,同时允许消费者检索数据

[英]Java Multithreading web scraper that extracts data continuously at every second while allowing consumer to retrieve data

I am working on a multi-threaded based Web application using Java.我正在使用 Java 开发基于多线程的 Web 应用程序。

I have two threads inside that application, a web Scraper and a thread to perform some computation( similar to a producer and consumer).我在该应用程序中有两个线程,一个 Web Scraper 和一个执行某些计算的线程(类似于生产者和消费者)。 The Scraper continuously reads the data from a third party API(world population that updates at each second). Scraper 不断从第三方 API(每秒更新的世界人口)读取数据。 The other thread (consumer) continuously tries to retrieve data from scraper and computes the fastest changing rate in every minute.另一个线程(消费者)不断尝试从刮板检索数据并计算每分钟的最快变化率。

My question is that the scraper needs to extract data at every second continuously.我的问题是, scraper需要每秒连续提取数据。 When the consumer retrieves data, it needs to lock the scraper 's variable(eg.buffer) where data is recorded.当消费者检索数据时,需要锁定scraper的记录数据的变量(例如buffer)。 However, that may prevent the scraper from recording data at each second continuously.但是,这可能会阻止刮板每秒连续记录数据。 Is there a method that allows the consumer to retrieve data without preventing the scraper from extracting data at every continuous second?有没有一种方法可以让消费者在不阻止scraper连续每秒提取数据的情况下检索数据?

Have a look at the BlockingQueue Java doc.查看 BlockingQueue Java 文档。 Implementations are threadsafe so your producer and consumer thread can safely communicate with eacher over the queue.实现是线程安全的,因此您的生产者和消费者线程可以通过队列安全地与每个人通信。 If you are worried about "missing a beat" when handing over the scraper result to the consumer, then start a new scraper thread every second.如果您在将抓取结果交给消费者时担心“错过一个节拍”,那么每秒启动一个新的抓取线程。 Then, if one scraper thread has to wait when handing over a result, it does not affect the scraping of other threads.那么,如果一个scraper线程在传递一个结果时不得不等待,不影响其他线程的scraping。 If the scraping result structure is timestamped you can deal with possible out of order messages on the consumer level.如果抓取结果结构带有时间戳,您可以在消费者级别处理可能的乱序消息。 Or on the queue level by using a PriorityQueue.或者在队列级别使用 PriorityQueue。 The PriorityQueue is not threadsafe though. PriorityQueue 不是线程安全的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM