简体   繁体   English

在句子中的词cooccurence

[英]Word cooccurence in sentences

I have a large set of sentences (10,000) in a file. 我在一个文件中有一大组句子(10,000)。 The file contains one sentence per file. 该文件包含每个文件一个句子。 In the entire set, I want to find out which words occur together in a sentence and their frequency. 在整个集合中,我想找出一个句子中出现的单词及其频率。

Sample sentences: 例句:

"Proposal 201 has been accepted by the Chief today.", 
"Proposal 214 and 221 are accepted, as per recent Chief decision",     
"This proposal has been accepted by the Chief.",
"Both proposal 3 MazerNo and patch 4 have been accepted by the Chief.",     
"Proposal 214, ValueMania, has been accepted by the Chief."};

I would like to code the following output. 我想编写以下输出。 I should be able to provide three starting words as parameters to program: "Chief, accepted, Proposal" 我应该能够提供三个起始单词作为程序参数:“Chief,accepted,Proposal”

Chief accepted Proposal            5
Chief accepted Proposal has        3
Chief accepted Proposal has been   3

... 
...
for all combinations.

I understand that the combinations might be huge. 我知道组合可能很大。

I have searched online but could not find. 我在网上搜索但找不到。 I have written some code but cant get my head around it. 我写了一些代码,但无法理解它。 Maybe someone who knows the domain might know. 也许知道域名的人可能知道。

ReadFileLinesIntoArray rf = new ReadFileLinesIntoArray();

            try {
                String[] tmp = rf.readFromFile("c:/scripts/SelectedSentences.txt");
                for (String t : tmp){
                      String[] keys = t.split(" ");
                      String[] uniqueKeys;
                      int count = 0;
                      System.out.println(t);
                      uniqueKeys = getUniqueKeys(keys);
                        for(String key: uniqueKeys)
                        {
                            if(null == key)
                            {
                                break;
                            }           
                            for(String s : keys)
                            {
                                if(key.equals(s))
                                {
                                    count++;
                                }               
                            }
                            System.out.println("Count of ["+key+"] is : "+count);
                            count=0;
                        }
                }
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }

private static String[] getUniqueKeys(String[] keys) {
        String[] uniqueKeys = new String[keys.length];

        uniqueKeys[0] = keys[0];
        int uniqueKeyIndex = 1;
        boolean keyAlreadyExists = false;

        for (int i = 1; i < keys.length; i++) {
            for (int j = 0; j <= uniqueKeyIndex; j++) {
                if (keys[i].equals(uniqueKeys[j])) {
                    keyAlreadyExists = true;
                }
            }

            if (!keyAlreadyExists) {
                uniqueKeys[uniqueKeyIndex] = keys[i];
                uniqueKeyIndex++;
            }
            keyAlreadyExists = false;
        }
        return uniqueKeys;
    }

Could someone help in coding this please? 有人可以帮忙编码吗?

You can apply standard information retrieval data structures, particularly an inverted index. 您可以应用标准信息检索数据结构,尤其是倒排索引。 Here is how you do it. 这是你如何做到的。

Consider your original sentences. 考虑你的原始句子。 Number them with some integer identifier, like so: 使用一些整数标识符为它们编号,如下所示:

  1. "Proposal 201 has been accepted by the Chief today.", “行政长官今天已接受提案201。”,
  2. "Proposal 214 and 221 are accepted, as per recent Chief decision", “根据最近的主要决定,”建议214和221被接受“,
  3. "This proposal has been accepted by the Chief.", “这项建议已获主任接纳。”,
  4. "Both proposal 3 MazerNo and patch 4 have been accepted by the Chief.", “提案3 MazerNo和补丁4都被主管接受了。”,
  5. "Proposal 214, ValueMania, has been accepted by the Chief." “提案214,ValueMania,已被主管接受。”

For every pair of words that you encounter in a sentence, add it to an inverted index that maps the pair to a set (a group of unique items) of sentence identifiers. 对于您在句子中遇到的每对单词,将其添加到倒置索引,该索引将该对映射到句子标识符的集合(一组唯一项)。 For a sentence of length N, there are N-choose-2 pairs. 对于长度为N的句子,有N-choose-2对。

The appropriate Java data structure will be Map<String, Map<String, Set<Integer>> . 适当的Java数据结构将是Map<String, Map<String, Set<Integer>> Order the pairs alphabetically so that the pair "has" and "Proposal" will occur only as ("has", "Proposal") and not ("Proposal", "has"). 按字母顺序排列对,以便“有”和“建议”对仅出现(“有”,“建议”)而不出现(“建议”,“有”)。

This map will contain the following: 此地图将包含以下内容:

"has", "Proposal" --> Set(1, 5)
"accepted", "Proposal" --> Set(1, 2, 5)
"accepted", "has" --> Set(1, 3, 5)
etc.

For example, the word pair "has" and "Proposal" has a set of (1, 5), meaning that they were found in sentences 1 and 5. 例如,单词对“has”和“Proposal”具有一组(1,5),意味着它们在句子1和5中找到。

Now suppose you want to look up the number of co-occurrences of the words in the list of "accepted", "has", and "Proposal". 现在假设您要查找“已接受”,“有”和“提案”列表中单词的共现次数。 Generate all pairs from this list and intersect their respective lists (using Java's Set.retainAll() ). 生成此列表中的所有对并与其各自的列表相交(使用Java的Set.retainAll() )。 The result here will be final set with (1, 5). 这里的结果将最终设置为(1,5)。 Its size is 2, meaning there are two sentences that contain "accepted", "has", and "Proposal". 它的大小为2,意味着有两个句子包含“已接受”,“有”和“提案”。

To generate all pairs, simply iterate through your map as needed. 要生成所有对,只需根据需要迭代地图。 To generate all word tuples of size N, you will need to iterate and the use recursion as needed. 要生成大小为N的所有单词元组,您需要根据需要迭代并使用递归。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM