最有效的方法来查找一个字符串对一个单词数组的匹配计数？

Question

let's say I have a string 假设我有一个字符串

String test = "This is a test string and I have some stopwords in here";

and I want to see how many times the words in the array below match against my string 我想看看下面数组中的单词与我的字符串匹配多少次

psudocode psudocode

array = a,and,the,them,they,I

so the answer would be "3" 所以答案是“3”

just curious what the most efficient way to do that in java is? 只是好奇在java中最有效的方法是什么？

Answer 1

I'd probably store the words in the input into a HashSet and then iterate over the array and see if each word in the array is .contains in the set. 我可能会将输入中的单词存储到HashSet中，然后迭代数组，看看数组中的每个单词是否都是.contains。

Here it is in code... the input is " Around the world in 80 days ". 这是代码......输入是“ 80天环游世界 ”。

import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Scanner;
import java.util.Set;

public class Main
{
    public static void main(final String[] argv)
        throws FileNotFoundException
    {
        final File     file;
        final String[] wordsToFind;

        file        = new File(argv[0]);
        wordsToFind = getWordsToFind(file);
        a(file, wordsToFind);
        b(file, wordsToFind);
        c(file, wordsToFind);
        d(file, wordsToFind);
    }

    // this just reads the file into the disk cache
    private static String[] getWordsToFind(final File file)
        throws FileNotFoundException
    {
        final Scanner     scanner;
        final Set<String> words;

        scanner = new Scanner(file);
        words   = new HashSet<String>();

        while(scanner.hasNext())
        {
            final String word;

            word = scanner.next();
            words.add(word);
        }

        return (words.toArray(new String[words.size()]));
    }

    // bad way, read intpo a list and then iterate over the list until you find a match
    private static void a(final File     file,
                          final String[] wordsToFind)
        throws FileNotFoundException
    {
        final long start;
        final long end;
        final long total;
        final Scanner      scanner;
        final List<String> words;
        int                matches;

        scanner = new Scanner(file);
        words   = new ArrayList<String>();

        while(scanner.hasNext())
        {
            final String word;

            word = scanner.next();
            words.add(word);
        }

        start = System.nanoTime();

        {
            matches = 0;

            for(final String wordToFind : wordsToFind)
            {
                for(final String word : words)
                {
                    if(word.equals(wordToFind))
                    {
                        matches++;
                        break;
                    }
                }
            }

            System.out.println(matches);
        }

        end   = System.nanoTime();
        total = end - start;
        System.out.println("a: " + total);
    }

    // slightly better way, read intpo a list and then iterate over the set (which reduces the number of things you progbably
    // have to read until you find a match), until you find a match
    private static void b(final File     file,
                          final String[] wordsToFind)
        throws FileNotFoundException
    {
        final long start;
        final long end;
        final long total;
        final Scanner     scanner;
        final Set<String> words;
        int               matches;

        scanner = new Scanner(file);
        words   = new HashSet<String>();

        while(scanner.hasNext())
        {
            final String word;

            word = scanner.next();
            words.add(word);
        }

        start = System.nanoTime();

        {
            matches = 0;

            for(final String wordToFind : wordsToFind)
            {
                for(final String word : words)
                {
                    if(word.equals(wordToFind))
                    {
                        matches++;
                        break;
                    }
                }
            }

            System.out.println(matches);
        }

        end   = System.nanoTime();
        total = end - start;
        System.out.println("b: " + total);
    }

    // my way
    private static void c(final File     file,
                          final String[] wordsToFind)
        throws FileNotFoundException
    {
        final long start;
        final long end;
        final long total;
        final Scanner     scanner;
        final Set<String> words;
        int               matches;

        scanner = new Scanner(file);
        words   = new HashSet<String>();

        while(scanner.hasNext())
        {
            final String word;

            word = scanner.next();
            words.add(word);
        }

        start = System.nanoTime();

        {
            matches = 0;

            for(final String wordToFind : wordsToFind)
            {
                if(words.contains(wordToFind))
                {
                    matches++;
                }
            }

            System.out.println(matches);
        }

        end   = System.nanoTime();
        total = end - start;
        System.out.println("c: " + total);
    }

    // Nikita Rybak way
    private static void d(final File     file,
                          final String[] wordsToFind)
        throws FileNotFoundException
    {
        final long start;
        final long end;
        final long total;
        final Scanner     scanner;
        final Set<String> words;
        int               matches;

        scanner = new Scanner(file);
        words   = new HashSet<String>();

        while(scanner.hasNext())
        {
            final String word;

            word = scanner.next();
            words.add(word);
        }

        start = System.nanoTime();

        {
            words.retainAll(new HashSet<String>(Arrays.asList(wordsToFind)));
            matches = words.size();
            System.out.println(matches);
        }

        end   = System.nanoTime();
        total = end - start;
        System.out.println("d: " + total);
    }
}

results (after a few runs, each run is pretty much the same though): 结果（经过几次运行后，每次运行几乎都是相同的）：

12596
a: 2440699000
12596
b: 2531635000
12596
c: 4507000
12596
d: 5597000

If you modify it by adding "XXX" to each of the words in getWordsToFind (so no words are found) you get: 如果你通过在getWordsToFind中的每个单词中添加“XXX”来修改它（所以没有找到单词），你得到：

0
a: 7415291000
0
b: 4688973000
0
c: 2849000
0
d: 7981000

And, for completeness, I tried it just searching for the word "I", and the results are: 而且，为了完整起见，我试着搜索单词“I”，结果如下：

1
a: 235000
1
b: 351000
1
c: 75000
1
d: 10725000

Answer 2

Something like this? 像这样的东西？ Not sure about 'most efficient', but simple enough. 不确定'最有效'，但很简单。

Set<String> s1 = new HashSet<String>(Arrays.asList("This is a test string and I have some stopwords in here".split("\\s")));
Set<String> s2 = new HashSet<String>(Arrays.asList("a", "and", "the", "them", "they", "I"));
s1.retainAll(s2);
System.out.println(s1.size());

Just intersection of two sets of words. 只是两组词的交集。

Answer 3

the most efficient thing to do is sort both 'test' and 'array' and then iterate over both: n.log(n) + n 最有效的方法是对'test'和'array'进行排序，然后迭代两者：n.log（n）+ n

test -> ['a', 'and', 'have', 'here', in, is, ..., 'This'] array -> ['a', 'and', 'the', 'them', 'they', 'I'] test - > ['a'，'和'，'have'，'here'，in，is，...，'This'] array - > ['a'，'和'，'the'，'他们'，'他们'，'我']

array test matches 'a' 'a' 1 'a' 'and' 1 'and' 'and' 2 'and' 'have' 2 'the' 'here' 2 'the' 'in' 2 'the' 'is' 2 ... 数组测试匹配'a''a'1'a''和'1'和''和'2'和'''''''''''''''''''''''''''是' '2 ......

Answer 4

A minor variation on Nikita's answer (up 1 for Nikita). 尼基塔答案的一个小变化（尼基塔增加1）。 If you use a List for s1, you get the number of occurrences (in case a word appears multiple times in the sentence). 如果对s1使用List，则会获得出现次数（如果单词在句子中出现多次）。

List<String> s1 = new ArrayList<String>(Arrays.asList("This is a test string and I have some stopwords in here".split("\\s")));
Set<String> s2 = new HashSet<String>(Arrays.asList("a", "and", "the", "them", "they", "I"));
s1.retainAll(s2);
System.out.println(s1.size());

Answer 5

store your strings in hashtable (HashMap of (String and Integer)) , then iterator over the text and increase the integer value for the matching word in hashtable . 将您的字符串存储在哈希表（HashMap of（String and Integer））中，然后对文本进行迭代，并增加哈希表中匹配单词的整数值。 then iterator over hashtable and sum all integer values. 然后在哈希表上迭代，并求和所有整数值。

最有效的方法来查找一个字符串对一个单词数组的匹配计数？

问题描述

5 个解决方案

解决方案1
5 2010-07-09 00:13:34

解决方案2
5 2010-07-09 00:16:44

解决方案3
3 2010-07-09 00:24:01

解决方案4
0 2010-07-09 01:31:35

解决方案5
0 2010-07-09 06:52:26

最有效的方法来查找一个字符串对一个单词数组的匹配计数？

问题描述

5 个解决方案

解决方案1 5 2010-07-09 00:13:34

解决方案2 5 2010-07-09 00:16:44

解决方案3 3 2010-07-09 00:24:01

解决方案4 0 2010-07-09 01:31:35

解决方案5 0 2010-07-09 06:52:26

解决方案1
5 2010-07-09 00:13:34

解决方案2
5 2010-07-09 00:16:44

解决方案3
3 2010-07-09 00:24:01

解决方案4
0 2010-07-09 01:31:35

解决方案5
0 2010-07-09 06:52:26