[英]How to make this piece of code thread safe?
此代碼是方法中的一部分。 該代碼使用兩個for循環遍歷兩個列表。 我想看看是否有可能使用多線程來加快兩個循環的速度。 我關心的是如何使它線程安全。
編輯:更完整的代碼
static class Similarity {
double similarity;
String seedWord;
String candidateWord;
public Similarity(double similarity, String seedWord, String candidateWord) {
this.similarity = similarity;
this.seedWord = seedWord;
this.candidateWord = candidateWord;
}
public double getSimilarity() {
return similarity;
}
public String getSeedWord() {
return seedWord;
}
public String getCandidateWord() {
return candidateWord;
}
}
static class SimilarityTask implements Callable<Similarity> {
Word2Vec vectors;
String seedWord;
String candidateWord;
Collection<String> label1;
Collection<String> label2;
public SimilarityTask(Word2Vec vectors, String seedWord, String candidateWord, Collection<String> label1, Collection<String> label2) {
this.vectors = vectors;
this.seedWord = seedWord;
this.candidateWord = candidateWord;
this.label1 = label1;
this.label2 = label2;
}
@Override
public Similarity call() {
double similarity = cosineSimForSentence(vectors, label1, label2);
return new Similarity(similarity, seedWord, candidateWord);
}
}
現在,此“計算”線程安全嗎? 涉及3個變量:
1) vectors;
2) toeknizerFactory;
3) similarities;
public static void compute() throws Exception {
File modelFile = new File("sim.bin");
Word2Vec vectors = WordVectorSerializer.readWord2VecModel(modelFile);
TokenizerFactory tokenizerFactory = new TokenizerFactory()
List<String> seedList = loadSeeds();
List<String> candidateList = loadCandidates();
log.info("Computing similarity: ");
ExecutorService POOL = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
List<Future<Similarity>> tasks = new ArrayList<>();
int totalCount=0;
for (String seed : seedList) {
Collection<String> label1 = getTokens(seed.trim(), tokenizerFactory);
if (label1.isEmpty()) {
continue;
}
for (String candidate : candidateList) {
Collection<String> label2 = getTokens(candidate.trim(), tokenizerFactory);
if (label2.isEmpty()) {
continue;
}
Callable<Similarity> callable = new SimilarityTask(vectors, seed, candidate, label1, label2);
tasks.add(POOL.submit(callable));
log.info("TotalCount:" + (++totalCount));
}
}
Map<String, Set<String>> similarities = new HashMap<>();
int validCount = 0;
for (Future<Similarity> task : tasks) {
Similarity simi = task.get();
Double similarity = simi.getSimilarity();
String seedWord = simi.getSeedWord();
String candidateWord = simi.getCandidateWord();
Set<String> similarityWords = similarities.get(seedWord);
if (similarity >= 0.85) {
if (similarityWords == null) {
similarityWords = new HashSet<>();
}
similarityWords.add(candidateWord);
log.info(seedWord + " " + similarity + " " + candidateWord);
log.info("ValidCount: " + (++validCount));
}
if (similarityWords != null) {
similarities.put(seedWord, similarityWords);
}
}
}
添加了另一種相關方法,該方法由call()方法使用:
public static double cosineSimForSentence(Word2Vec vectors, Collection<String> label1, Collection<String> label2) {
try {
return Transforms.cosineSim(vectors.getWordVectorsMean(label1), vector.getWordVectorsMean(label2));
} catch (Exception e) {
log.warn("OOV: " + label1.toString() + " " + label2.toString());
//e.getMessage();
//e.printStackTrace();
return 0.0;
}
}
(已更新問題的答案。)
通常,在嘗試優化代碼之前,應該對代碼進行概要分析,尤其是在代碼非常復雜的情況下。
對於線程,您需要確定線程之間共享的可變狀態。 理想情況下,在求助於鎖和並發數據結構之前,應盡可能地多。 這樣,一個線程中包含的可變狀態就不是問題。 不可變的東西很棒。
我認為傳遞給您的任務的任何內容都不會被修改。 很難說。 在場上final
是個好主意。 集合可以放置在不可修改的包裝器中,盡管這並不能阻止它們通過其他引用進行修改,並且現在可以以靜態類型顯示。
假設您不破壞內部循環,則唯一的共享可變狀態似乎是similarities
及其包含的值。
您可能會或可能不會發現您最終仍然做過多的串行操作,並且需要更改similarities
才能並發
ConcurrentMap<String, Set<String>> similarities = new ConcurrentHashMap<>();
similarities
的get
和put
將必須是線程安全的。 我建議始終創建Set
。
Set<String> similarityWords = similarities.getOrDefault(seed, new HashSet<>());
要么
Set<String> similarityWords = similarities.computeIfAbsent(seed, key -> new HashSet<>());
您可以使用線程安全的Set
(例如,與Collections.synchronizedSet
),但是我建議為整個內部循環持有一個相關的鎖。
synchronized (similarityWords) {
...
}
如果您想懶惰地創建similarityWords
那將是“更加有趣”。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.