简体   繁体   中英

How to make this piece of code thread safe?

This code is part of within a method. The code go through two lists using two for loop. I want to see whether there is a possibility of using multi thread to speed up this process for the two loops. My concern is how to make it thread safe.

EDITTED: more complete code

static class Similarity {
        double similarity;
        String seedWord;
        String candidateWord;

        public Similarity(double similarity, String seedWord, String candidateWord) {
            this.similarity = similarity;
            this.seedWord = seedWord;
            this.candidateWord = candidateWord;
        }

        public double getSimilarity() {
            return similarity;
        }

        public String getSeedWord() {
            return seedWord;
        }

        public String getCandidateWord() {
            return candidateWord;
        }
    }

    static class SimilarityTask implements Callable<Similarity> {
        Word2Vec vectors;
        String seedWord;
        String candidateWord;
        Collection<String> label1;
        Collection<String> label2;

        public SimilarityTask(Word2Vec vectors, String seedWord, String candidateWord, Collection<String> label1, Collection<String> label2) {
            this.vectors = vectors;
            this.seedWord = seedWord;
            this.candidateWord = candidateWord;
            this.label1 = label1;
            this.label2 = label2;
        }

        @Override
        public Similarity call() {
            double similarity = cosineSimForSentence(vectors, label1, label2);
            return new Similarity(similarity, seedWord, candidateWord);
        }
    }

Now, is this 'compute' thread safe? There are 3 variables involved:

1) vectors;
  2) toeknizerFactory;
  3) similarities;

public static void compute() throws Exception {

        File modelFile = new File("sim.bin");
        Word2Vec vectors = WordVectorSerializer.readWord2VecModel(modelFile);

        TokenizerFactory tokenizerFactory = new TokenizerFactory()

        List<String> seedList = loadSeeds();
        List<String> candidateList = loadCandidates();

        log.info("Computing similarity: ");

        ExecutorService POOL = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
        List<Future<Similarity>> tasks = new ArrayList<>();
        int totalCount=0;
        for (String seed : seedList) {
            Collection<String> label1 = getTokens(seed.trim(), tokenizerFactory);
            if (label1.isEmpty()) {
                continue;
            }
            for (String candidate : candidateList) {
                Collection<String> label2 = getTokens(candidate.trim(), tokenizerFactory);
                if (label2.isEmpty()) {
                    continue;
                }
                Callable<Similarity> callable = new SimilarityTask(vectors, seed, candidate, label1, label2);
                tasks.add(POOL.submit(callable));
                log.info("TotalCount:" + (++totalCount));
            }
        }

        Map<String, Set<String>> similarities = new HashMap<>();
        int validCount = 0;
        for (Future<Similarity> task : tasks) {
            Similarity simi = task.get();
            Double similarity = simi.getSimilarity();
            String seedWord = simi.getSeedWord();
            String candidateWord = simi.getCandidateWord();

            Set<String> similarityWords = similarities.get(seedWord);
            if (similarity >= 0.85) {
                if (similarityWords == null) {
                    similarityWords = new HashSet<>();
                }
                similarityWords.add(candidateWord);
                log.info(seedWord + " " + similarity + " " + candidateWord);
                log.info("ValidCount: "  + (++validCount));
            }

            if (similarityWords != null) {
                similarities.put(seedWord, similarityWords);
            }
        }
}

Added one more relevant method, which is used by the call() method:

public static double cosineSimForSentence(Word2Vec vectors, Collection<String> label1, Collection<String> label2) {
        try {
            return Transforms.cosineSim(vectors.getWordVectorsMean(label1), vector.getWordVectorsMean(label2));
        } catch (Exception e) {
            log.warn("OOV: " + label1.toString() + " " + label2.toString());
            //e.getMessage();
            //e.printStackTrace();
            return 0.0;
        }
    }

(Answer updated for changed question.)

In general you should profile the code before attempting to optimise it, particularly if it is quite complex.

For threading you need to identify which mutable state is shared between threads. Ideally as much as that as possible before resorting to locks and concurrent data structures. Mutable state that is contained within one thread isn't a problem as such. Immutables are great.

I assume nothing passed to your task gets modified. It's tricky to tell. final on fields is a good idea. Collections can be placed in unmodifiable wrappers, though that doesn't stop them being modified via other references and does now show itself in static types.

Assuming you don't break up the inner loop, the only shared mutable state appears to be similarities and the values it contains.

You may or may not find you still end up doing too much serially and need to change similarities to become concurrent

    ConcurrentMap<String, Set<String>> similarities = new ConcurrentHashMap<>();

The get and put of similarities will need to be thread-safe. I suggest always creating the Set .

        Set<String> similarityWords = similarities.getOrDefault(seed, new HashSet<>());

or

        Set<String> similarityWords = similarities.computeIfAbsent(seed, key -> new HashSet<>());

You could use a thread-safe Set (for instance with Collections.synchronizedSet ), but I suggest holding a relevant lock for the entire inner loop.

synchronized (similarityWords) {
    ...
}

If you wanted to create similarityWords lazily then it would be "more fun".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM