计算包含句子的列表中单词的出现

Question

I am having some problem with Java programming which includes List. 我在使用Java编程（包括List）时遇到了一些问题。 Basically, what I am trying to count the occurences of each word in a sentence from a list containing several sentences. 基本上，我试图从包含几个句子的列表中计算一个句子中每个单词的出现次数。 The code for the list containing sentences is as below: 包含句子的列表的代码如下：

List<List<String>> sort = new ArrayList<>();
for (String sentence : complete.split("[.?!]\\s*"))
{
    sort.add(Arrays.asList(sentence.split("[ ,;:]+"))); //put each sentences in list
}

The output from the list is as follows: 列表的输出如下：

[hurricane, gilbert, head, dominican, coast]
[hurricane, gilbert, sweep, dominican, republic, sunday, civil, defense, alert, heavily, populate, south, coast, prepare, high, wind]
[storm, approach, southeast, sustain, wind, mph, mph]
[there, alarm, civil, defense, director, a, television, alert, shortly]

The output desired should be as follows (only an example). 所需的输出应如下所示（仅作为示例）。 It will output all the unique word in the list and calculate the occurences by sentences. 它将输出列表中的所有唯一单词并通过句子计算出现次数。

Word: hurricane
Sentence 1: 1 times
Sentence 2: 1 times
Sentence 3: 0 times
Sentence 4: 0 times

Word: gilbert
Sentence 1: 0 times
Sentence 2: 2 times
Sentence 3: 1 times
Sentence 4: 0 times 

Word: head
Sentence 1: 3 times
Sentence 2: 2 times
Sentence 3: 0 times
Sentence 4: 0 times 

and goes on....

With the example above, the word 'hurricane' occur 1 time in the first sentence, 1 time in second sentence, none in third sentence and none in forth sentence. 在上面的示例中，单词“飓风”在第一句中出现1次，在第二句中出现1次，在第三句中没有出现，在第四句中没有出现。 How do I achieve the output? 如何获得输出？ I was thinking of a 2D matrices for building them. 我在考虑用于构建它们的2D矩阵。 Any help will be appreciated. 任何帮助将不胜感激。 Thanks! 谢谢！

Answer 1

This is a working solution. 这是一个可行的解决方案。 I did not take care of the printing. 我没有打理打印。 The result is a Map -> Word, Array. 结果是一个Map-> Word，Array。 Where Array contains the count of Word in each sentence indexed from 0. Runs in O(N) time. 其中Array包含从0索引的每个句子中Word的计数。以O（N）时间运行。 Play here: https://repl.it/Bg6D 在这里播放： https : //repl.it/Bg6D

    List<List<String>> sort = new ArrayList<>();
    Map<String, ArrayList<Integer>> res = new HashMap<>();

    // split by sentence
    for (String sentence : someText.split("[.?!]\\s*")) {
        sort.add(Arrays.asList(sentence.split("[ ,;:]+"))); //put each sentences in list
    }

    // put all word in a hashmap with 0 count initialized
    final int sentenceCount = sort.size();
    sort.stream().forEach(sentence -> sentence.stream().forEach(s -> res.put(s, new ArrayList<Integer>(Collections.nCopies(sentenceCount, 0)))));

    int index = 0;
    // count the occurrences of each word for each sentence.
    for (List<String> sentence: sort) {
        for (String s : sentence) {
            res.get(s).set(index, res.get(s).get(index) + 1);
        }
        index++;
    }

EDIT: In answer to your comment. 编辑：在回答您的评论。

  List<Integer> getSentence(int sentence, Map<String, ArrayList<Integer>> map) {
     return map.entrySet().stream().map(e -> e.getValue().get(sentence)).collect(Collectors.toList());
  }

Then you can call 那你可以打电话

List<Integer> sentence0List = getSentence(0, res);

However be aware that this approach is not optimal since it runs in O(K) time with K being the number of sentences. 但是请注意，这种方法不是最佳方法，因为它以O（K）时间运行，其中K为句子数。 For small K it is totally fine but it does not scale. 对于小K，这完全可以，但是不能缩放。 You have to clarify yourself what will you do with the result. 您必须阐明自己将如何处理结果。 If you need to call getSentence many times, this is not the correct approach. 如果您需要多次调用getSentence ，则这不是正确的方法。 In that case you will need the data structured differently. 在这种情况下，您将需要不同的数据结构。 Something like 就像是

Sentences = [
         {'word1': N, 'word2': N},... // sentence 1 
         {'word1': N, 'word2': N},... // sentence 2

] ]

So you are able to easily access the word count per each sentence. 因此，您可以轻松访问每个句子的字数统计。

EDIT 2: Call this method: 编辑2：调用此方法：

  Map<String, Float> getFrequency(Map<String, ArrayList<Integer>> stringMap) {
    Map<String, Float> res = new HashMap<>();
    stringMap.entrySet().stream().forEach(e -> res.put(e.getKey()
                , e.getValue().stream().mapToInt(Integer::intValue).sum() / (float)e.getValue().size()));
    return res;
  }

Will return something like this: 将返回如下内容：

{standard=0.25, but=0.25, industry's=0.25, been=0.25, 1500s=0.25, software=0.25, release=0.25, type=0.5, when=0.25, dummy=0.5, Aldus=0.25, only=0.25, passages=0.25, text=0.5, has=0.5, 1960s=0.25, Ipsum=1.0, five=0.25, publishing=0.25, took=0.25, centuries=0.25, including=0.25, in=0.25, like=0.25, containing=0.25, printer=0.25, is=0.25, t

Answer 2

You could solve your problem by first creating an index for each word. 您可以通过首先为每个单词创建一个索引来解决您的问题。 You could use a Hashmap and put just put all the single words on it, which you find in your text (so you would have no need for checking double occurrences). 您可以使用Hashmap并将所有单个单词放在其上，您可以在文本中找到它们（这样就无需检查两次出现的情况）。

Then you can iterate the HashMap and check for every Word in every sentence. 然后，您可以迭代HashMap并检查每个句子中的每个Word。 You can count occurrences by using the indexOf method of your list. 您可以使用列表的indexOf方法来计算出现次数。 As long as it returns a value greater than -1 you can count up the occurrence in the sentence. 只要它返回的值大于-1，您就可以计算句子中的出现次数。 This method does only return the first occurrence so you 此方法只会返回第一次出现的情况，因此您

Some Pseudocode would be like: 一些伪代码如下：

Array sentences = text.split(sentence delimiter) 数组句子= text.split（句子分隔符）

for each word in text
    put word on hashmap

for each entry in hashmap
   for each sentence
       int count = 0
       while subList(count, sentence.length) indexOf(entry) > -1
          count for entry ++

Note that this is very greedy and not performance oriented at all. 请注意 ，这是非常贪婪的，根本不是面向性能的。 Oh yea, and also note, that there are some java nlp libraries out there which may have already solved your problem in a performance oriented and reusable way. 哦，是的，还要注意，那里有一些java nlp库，它们可能已经以面向性能和可重用的方式解决了您的问题。

Answer 3

First you can segment your sentences and then tokenize them using a text segmentor such as NLTK or Stanford tokenizer. 首先，您可以对句子进行细分，然后使用文本分割器（例如NLTK或Stanford tokenizer）将其标记化。 Splitting the string (containing sentences) around "[.?!]" is not a good idea. 在“ [。？！]”周围分割字符串（包含句子）不是一个好主意。 What happens to an "etc." “等”会怎样？ or "eg" that occurs in the middle of the sentence? 或出现在句子中间的“例如”？ Splitting a sentence around "[ ,;:]" is also not a good idea. 在“ [，，::]”周围分隔句子也不是一个好主意。 You can have plenty of other symbols in a sentence such as quotation marks, dash and so on. 句子中可以包含许多其他符号，例如引号，破折号等。

After segmentation and tokenization you can split your sentences around space and store them in a List<List<String>> : 在进行分段和标记化之后，您可以在空间中拆分句子并将其存储在List<List<String>> ：

List<List<String>> sentenceList = new ArraList();

Then for your index you can create a HashMap<String,List<Integer>> : 然后可以为您的索引创建一个HashMap<String,List<Integer>> ：

HashMap<String,List<Integer>> words = new HashMap();

Keys are all words in all sentences. 键是所有句子中的所有单词。 Values you can update as follows: 您可以更新的值如下：

for(int i = 0 ; i < sentenceList.size() ; i++){
    for(String w : words){
        if(sentence.contains(w)){
           List tmp = words.get(w);
           tmp.get(i)++; 
           words.put(w, tmp);
         }
    }
}

This solution has the time complexity of O(number_of_sentences*number_of_words) which is equivalent to O(n^2). 此解决方案的时间复杂度为O（number_of_sentences * number_of_words），相当于O（n ^ 2）。 An optimized solution is: 一种优化的解决方案是：

for(int i = 0 ; i < sentenceList.size() ; i++){
    for(String w : sentenceList.get(i)){
        List tmp = words.get(w);
        tmp.get(i)++; 
        words.put(w, tmp);
    }
}

This has the time complexity of O(number_of_sentences*average_length_of_sentences). 这具有O（number_of_sentences * average_length_of_sentences）的时间复杂度。 Since average_length_of_sentences is usually small this is equivalent to O(n). 由于average_length_of_sentences通常很小，因此它等于O（n）。

计算包含句子的列表中单词的出现

问题描述

3 个解决方案

解决方案1
1 已采纳 2016-01-25 07:18:27

解决方案2
0 2016-01-25 07:10:38

解决方案3
0 2016-01-25 07:46:03

计算包含句子的列表中单词的出现

问题描述

3 个解决方案

解决方案1 1 已采纳 2016-01-25 07:18:27

解决方案2 0 2016-01-25 07:10:38

解决方案3 0 2016-01-25 07:46:03

解决方案1
1 已采纳 2016-01-25 07:18:27

解决方案2
0 2016-01-25 07:10:38

解决方案3
0 2016-01-25 07:46:03