简体   繁体   English

如何根据特定条件锁定代码块?

[英]How to lock a code block based on a certain condition?

EDIT: I've added a table example (see google sheets link) and how the resulting apple object should look like. 编辑:我添加了一个表格示例(请参阅Google表格链接),以及生成的Apple对象应如何显示。

I've programmed a multi-threaded web scraper using Jsoup, which extracts information from a website and saves it into a map. 我已经使用Jsoup编程了一个多线程的Web爬虫,该爬虫从网站中提取信息并将其保存到地图中。 The main thing which i can't get to work is that the program does NOT connect to the website if it already scraped a certain information. 我无法使用的主要功能是,如果该程序已经抓取了某些信息,则该程序不会连接到该网站。

Information about the program 有关程序的信息

It extracts information from a table on a website and starts a thread for every word in the table. 它从网站上的表格中提取信息,并为表格中的每个单词启动一个线程。

So the threads get started with a certain word as class member. 因此,线程以某个词作为类成员开始。 Every thread also has the same ConcurrentHashMap object. 每个线程还具有相同的ConcurrentHashMap对象。 My plan was to check if the word already exists in the map as key. 我的计划是检查单词是否已作为键存在于地图中。
If NOT, it should connect to a website to get information about the word, add some data to it and put it in the map afterwards. 如果不是,它应该连接到网站以获取有关单词的信息,向其中添加一些数据,然后将其放入地图中。
If the map already contains the word, the thread should get the value from the map and only add the data to it. 如果映射已经包含单词,则线程应从映射中获取值,并仅将数据添加到该值。

So the main goal is NOT to connect to the website twice for the same word. 因此,主要目标是不要因为同一单词而两次连接到该网站。

Here are the relevant code snippets: 以下是相关的代码段:

Main class 主班
Starting a thread for every word in the table. 为表中的每个单词启动一个线程。 "element" contains the word and an url for more information about the word. “ element”包含单词和有关该单词的更多信息的URL。

for (Element element : allRelevantTableElements) {
    executorService.execute(new Worker(element, data, concurrentMap));
}

Worker class 工人阶级
1. Check if word is already in map. 1.检查单词是否已经在地图中。
2a. 2A。 If it is in map, just add data to it. 如果它在地图中,则只需向其中添加数据。
2b. 2B。 If it is not in map, scrape information from website and then add data to it. 如果不在地图上,请从网站上抓取信息,然后向其中添加数据。

public class Worker implements Runnable {

MyWebScraper scraper;
Element element;    
String data;
ConcurrentMap<String, Fruit> concurrentMap;

public Worker(Element element, String data, ConcurrentMap<String, Fruit> concurrentMap) {
    this.element = element;
    this.data = data;
    this.concurrentMap = concurrentMap;
}

@Override
public void run() {

    Fruit fruit;

    if (concurrentMap.containsKey(element.text())) { 
        fruit = concurrentMap.get(element.text());
        fruit.addData(data)
    } else {            
        scraper = new WebScraper("http://fruitinformation.com" + element.attr("href"));
        scraper.connect();
        fruit = scraper.getInformation();
        fruit.addData(data)
    }

    concurrentMap.put(element.text(), fruit);
}
}

Example
Lets say the table looks like this: 可以说表格如下:

https://docs.google.com/spreadsheets/d/1JF8sh8Sp9y0SV3Xb5mlISgcJp5s_DhaSp3KbnQLa248/edit?usp=sharing https://docs.google.com/spreadsheets/d/1JF8sh8Sp9y0SV3Xb5mlISgcJp5s_DhaSp3KbnQLa248/edit?usp=sharing

The main class will start 3 threads: 主类将启动3个线程:
Thread 1: Element contains "Apple" and the suburl "/apple", 线程1:元素包含“ Apple”和子网址“ / apple”,
Data contains "1,20€" 数据包含“ 1,20€”
Thread 2: Element contains "Orange" and the suburl "/orange", 线程2:元素包含“橙色”和子URL“ /橙色”,
Data contains "2,40€" 数据包含“ 2,40€”
Thread 3: Element contains "Apple" and the suburl "/apple", 线程3:元素包含“ Apple”和子网址“ / apple”,
Data contains "1,50€" 数据包含“ 1,50€”

The problem is that all threads run almost simultaneously, so thread 1 and 3 will both check if "apple" is already in the map and BOTH will get false as result. 问题在于所有线程几乎同时运行,因此线程1和3都将检查“ apple”是否已在映射中,并且两者都将得到false。 So they BOTH connect to the website fruitinformation.com/apple and get the basic information about apples which i only need once. 所以他们都连接到了fruitinformation.com/apple网站,并获得了我只需要一次的有关苹果的基本信息。 Then BOTH will add their data to the returned object and put it in the map, but thread 1 will do that first with "1,20€" and then thread 2 overrides the "1,20€" apple with his "1,50€ apple as the value. 然后BOTH会将他们的数据添加到返回的对象中并将其放入地图中,但是线程1首先使用“ 1,20€”执行此操作,然后线程2将其“ 1,50”替换为“ 1,20€”苹果€苹果为价值。

However the goal is that only ONE apple thread connects to the website and adds his data(for example 1,20€) and then the other one realizes that an apple object already exists in the map and only adds his data(1,50€) to the existing apple. 但是,目标是只有一个苹果线程连接到网站并添加其数据(例如1,20欧元),然后另一个对象意识到地图中已经存在一个苹果对象,并且仅添加其数据(1,50欧元) )到现有的苹果。 The fruit objects have Lists for that. 水果对象具有该列表。
So the resulting map entry should look like this: 因此,生成的地图条目应如下所示:
Key=Apple , Value= Fruit["Apple", basicInformationFromWebsite, List["1,20€"; "1,50€"]]

The other thread (orange) should run totally unaffected by all this. 另一个线程(橙色)应完全不受此影响。 So all different fruits should run simutaneously but elements with the same fruit have to respect each other somehow. 因此,所有不同的水果应同时运行,但是具有相同水果的元素必须以某种方式相互尊重。 Is there a type of synchronization which only blocks instances with the same fruit names, but doesnt block any other instances? 是否存在一种同步类型,它仅阻止具有相同水果名称的实例,而不会阻止任何其他实例?


I've read a lot about synchronization, locks, etc but can't find a solution for my problem. 我已经阅读了很多有关同步,锁等的信息,但是找不到解决我问题的方法。
It would be nice if someone can help me, thanks in advance! 如果有人可以帮助我,那就太好了,谢谢!

XY problem. XY问题。 Synchronisation won't fix this. 同步无法解决此问题。 Even assuming you could implement it, the second thread would just be blocked by the first and then proceed to do the unwanted crawl. 即使假设您可以实现它,第二个线程也只会被第一个线程阻塞,然后继续执行不需要的爬网。

You could add a Set of words that have begun to be processed, or add a dummy element into the map that shows it is already being processed although not complete. 您可以添加已开始处理的一组单词,或在地图中添加一个虚拟元素以显示该元素已被处理,尽管尚未完成。

If you get the total list of words first , then just pre-populate the map with placeholder values. 如果你的话总榜第一 ,那么就预填充占位符值地图。 then you only need to start threads for each of the keys in your map. 那么您只需为地图中的每个键启动线程。

Not sure my answer is in line with how you've structured your app, but what follows is the "correct" way of handling your type of problem which is quite common in parallel applications. 不确定我的答案是否与您构建应用程序的方式一致,但是接下来是处理您的问题类型的“正确”方法,这在并行应用程序中很常见。

It is certainly doable to obtain what you want and avoid "double" computation. 当然,获得所需内容并避免“双重”计算是可行的。 I suggest you read java concurrency in practice and more specifically chapter 5 I think it is, where they have to do memoization of calculations (huge computations) and also have to avoid two threads calculating the same number. 我建议您在实践中阅读Java并发,尤其是在我认为的第5章中,他们必须做计算的记述(大量计算),还必须避免两个线程计算相同的数字。

Some tricks you can apply are to use putIfAbsent (method for only putting an item into a map if it does not already exist). 您可以应用的一些技巧是使用putIfAbsent (仅当项目不存在时才将其放入地图中的方法)。 More to the point however I suggest you store Futures in your map instead. 更重要的是,我建议您将期货存储在地图中。 https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Future.html They represent the result of a computation and then you will both have the computation ongoing and be certain that it will not be computed twice yet you will still get the result for both threads as you just call future.get() which blocks until a result is received. https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Future.html它们代表计算的结果,然后您将同时进行计算并确定不会计算两次,但您只要调用future.get()就会阻塞,直到收到结果为止,您仍将获得两个线程的结果。 I will not go into much detail as it is in fact shown very nicely in the java concurrency book. 我不会赘述,因为实际上在Java并发性书中很好地展示了它。

So something like (pseudocode) 所以像(伪代码)

if !map.containsKey(word) {
    Future f = new Future(word)
    map.putIfAbsent(word, future<curWord>)
    f.get()
} else {
    Future f = map.get(word)
    f.get()
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM