简体   繁体   English

计算唯一数据双精度出现在双精度列表python 3中的次数

[英]Counting the number of times a unique data double appears in double list python 3

Say I have a double list in python [[],[]] : 假设我在python [[],[]]有一个双重列表:

doublelist = [["all", "the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste"], 
              ["the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste", "lumps"]]

I want to count how many times doublelist[0][0] & doublelist[1][0] = all, the appear in the dual list. 我想计算doublelist[0][0] & doublelist[1][0] = all, the出现在对偶列表中的次数。 With the second [] being the index. 第二个[]为索引。

For example you see one count of it at doublelist[0][0] doublelist[1][0] and another at doublelist[0][6] doublelist[1][6] . 例如,您在doublelist[0][0] doublelist[1][0]看到一个计数,在doublelist[0][6] doublelist[1][6]看到另一个计数。

What code would I use in Python 3 to iterate through doublelist[i][i] grab each value set ex. 我将在Python 3中使用什么代码来遍历doublelist[i][i]获取ex的每个值。 [["all"],["the"]] and also an integer value for how many times that value set exists in the list. [["all"],["the"]]以及该值集在列表中存在多少次的整数值。

Ideally I'd like to output it to a triple list triplelist[[i],[i],[i]] that contains the [i][i] value and the integer in the third [i] . 理想情况下,我想将其输出到包含[i][i]值和第三个[i]的整数的三元组triplelist[[i],[i],[i]] [i]

Example code: 示例代码:

for i in triplelist[0]:
    print(triplelist[0][i])
    print(triplelist[1][i])
    print(triplelist[2][i])

Output: 输出:

>"all"
>"the"
>2
>"the"
>"big"
>1
>"big"
>"dogs"
>1

etc... 等等...

Also it would preferably skip duplicates so there wouldn't be 2 indexes in the list where [i][i][i] = [[all],[the],[2]] since there are 2 instances in the original list ([0][0] [1][0] & [0][6] [1][6]). 同样,它最好跳过重复项,这样列表中就不会有2个索引,其中[i][i][i] = [[all],[the],[2]]因为原始列表中有2个实例([0] [0] [1] [0]和[0] [6] [1] [6])。 I just want all unique dual sets of words and the amount of times they appear in the original text. 我只希望所有独特的双组单词及其在原始文本中出现的次数。

The purpose of the code is to see how often one word follows another word in a given text. 该代码的目的是查看给定文本中一个单词跟随另一个单词的频率。 It's for building essentially a smart Markov Chain Generator that weights word values. 它实际上是用于构建加权单词值的智能马尔可夫链生成器。 I already have the code to break the text into a dual list that contains the word in the first list and the following word in the second list for this purpose. 我已经有了将文本分解成双重列表的代码,为此,该列表包含第一个列表中的单词和第二个列表中的后续单词。

Here is my current code for reference (the problem is after I initialize wordlisttriple, I don't know how to make it do what I described above after that): 这是我当前的参考代码(问题是在初始化wordlisttriple之后,我不知道如何使它执行我上面描述的操作):

#import
import re #for regex expression below

#main
with open("text.txt") as rawdata:    #open text file and create a datastream
    rawtext = rawdata.read()    #read through the stream and create a string containing the text
rawdata.close()    #close the datastream
rawtext = rawtext.replace('\n', ' ')    #remove newline characters from text
rawtext = rawtext.replace('\r', ' ')    #remove newline characters from text
rawtext = rawtext.replace('--', ' -- ')    #break up blah--blah words so it can read 2 separate words blah -- blah
pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)    #regex pattern for grabbing everthing before a sentence ending punctuation
sentencelist = []    #initialize list for sentences in text
sentencelist = pat.findall(rawtext)    #apply regex pattern to string to create a list of all the sentences in the text
firstwordlist = []    #initialize the list for the first word in each sentence
for index, firstword in enumerate(sentencelist):    #enumerate through the sentence list
    sentenceindex = int(index)    #get the index for below operation
    firstword = sentencelist[sentenceindex].split(' ')[0]    #use split to only grab the first word in each sentence
    firstwordlist.append(firstword)    #append each sentence starting word to first word list
rawtext = rawtext.replace(', ', ' , ')    #break up punctuation so they are not considered part of words
rawtext = rawtext.replace('. ', ' . ')    #break up punctuation so they are not considered part of words
rawtext = rawtext.replace('"', ' " ')    #break up punctuation so they are not considered part of words
sentencelistforwords = []    #initialize sentence list for parsing words
sentencelistforwords = pat.findall(rawtext)    #run the regex pattern again this time with the punctuation broken up by spaces
wordsinsentencelist = []    #initialize list for all of the words that appear in each sentence
for index, words in enumerate(sentencelist):    #enumerate through sentence list
    sentenceindex = int(index)    #grab the index for below operation
    words = sentencelist[sentenceindex].split(' ')    #split up the words in each sentence so we have a nested lists that contain each word in each sentence
    wordsinsentencelist.append(words)    #append above described to the list
wordlist = []    #initialize list of all words
wordlist = rawtext.split(' ')    #create list of all words by splitting the entire text by spaces
wordlist = list(filter(None, wordlist))    #use filter to get rid of empty strings in the list
wordlistdouble = [[], []]    #initialize the word list double to contain words and the words that follow them in sentences
for index, word in enumerate(wordlist):    #enumerate through word list
    if(int(index) < int(len(wordlist))-1):    #only go to 1 before the end of list so we don't get an index out of bounds error
        wordlistindex1 = int(index)    #grab index for first word
        wordlistindex2 = int(index)+1    #grab index for following word
        wordlistdouble[0].append(wordlist[wordlistindex1])    #append first word to first list of word list double
        wordlistdouble[1].append(wordlist[wordlistindex2])    #append following word to second list of word list double
wordlisttriple = [[], [], []]    #initialize word list triple
for index, unit in enumerate(wordlistdouble[0]):    #enumerate through word list double
    word1 = wordlistdouble[0][index]    #grab word at first list of word list double at the current index
    word2 = wordlistdouble[1][index]    #grab word at second list of word list double at the current index
    count = 0    #initialize word double data set counter
    wordlisttriple[0].append(word1)    #these need to be encapsulated in some kind of loop/if/for idk
    wordlisttriple[1].append(word2)    #these need to be encapsulated in some kind of loop/if/for idk
    wordlisttriple[2].append(count)    #these need to be encapsulated in some kind of loop/if/for idk
    #for index, unit1 in enumerate(wordlistdouble[0]):
        #if(wordlistdouble[0][int(index)] == word1 && wordlistdouble[1][int(index)+1] == word2):
            #count++

#sentencelist = list of all sentences
#firstwordlist = list of words that start sentencelist
#sentencelistforwords = list of all sentences mutated for ease of extracting words
#wordsinsentencelist = list of lists containing all of the words in each sentence
#wordlist = list of all words
#wordlistdouble = dual list of all words plus the words that follow them

Any advice would be greatly appreciated. 任何建议将不胜感激。 If I'm going about this the wrong way and there is an easier method to accomplish the same thing, that would also be amazing. 如果我以错误的方式进行操作,并且有一种更简单的方法来完成同一件事,那也将是惊人的。 Thank you! 谢谢!

Assuming you already have the text parsed to list of words you can just create iterator that starts from second word, zip it with words and run it through Counter : 假设您已经将文本解析为单词列表,则可以创建从第二个单词开始的迭代器,将其zip为单词,然后通过Counter运行它:

from collections import Counter

words = ["all", "the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste", "lumps"]
nxt = iter(words)
next(nxt, None)

print(*Counter(zip(words, nxt)).items(), sep='\n')

Output: 输出:

(('big', 'dogs'), 1)
(('kids', 'eat'), 1)
(('small', 'kids'), 1)
(('the', 'big'), 1)
(('dogs', 'eat'), 1)
(('eat', 'paste'), 1)
(('all', 'the'), 2)
(('chicken', 'all'), 1)
(('paste', 'lumps'), 1)
(('eat', 'chicken'), 1)
(('the', 'small'), 1)

In above nxt is an iterator that iterates over the word list. 在上面的nxt有一个迭代器,它迭代单词列表。 Since we want it to start from second word we pull one word out with next before using it: 因为我们希望它从第二个单词开始,所以在使用它之前先将一个单词与next一个单词拉出:

>>> nxt = iter(words)
>>> next(nxt)
'all'
>>> list(nxt)
['the', 'big', 'dogs', 'eat', 'chicken', 'all', 'the', 'small', 'kids', 'eat', 'paste', 'lumps']

Then we pass the original list and iterator to zip that will return iterable of tuples where each tuple has one item from both: 然后,我们将原始列表和迭代器传递给zip ,这将返回可迭代的元组,其中每个元组都有两个元素中的一个:

>>> list(zip(words, nxt))
[('all', 'the'), ('the', 'big'), ('big', 'dogs'), ('dogs', 'eat'), ('eat', 'chicken'), ('chicken', 'all'), ('all', 'the'), ('the', 'small'), ('small', 'kids'), ('kids', 'eat'), ('eat', 'paste'), ('paste', 'lumps')]

Finally the output from zip is passed to Counter that counts each pair and returns dict like object where keys are pairs and values are counts: 最后,将zip的输出传递到CounterCounter对每对进行Counter ,并返回dict类的对象,其中键是对,值是计数:

>>> Counter(zip(words, nxt))
Counter({('all', 'the'): 2, ('eat', 'chicken'): 1, ('big', 'dogs'): 1, ('small', 'kids'): 1, ('kids', 'eat'): 1, ('paste', 'lumps'): 1, ('chicken', 'all'): 1, ('dogs', 'eat'): 1, ('the', 'big'): 1, ('the', 'small'): 1, ('eat', 'paste'): 1})

So, originally I was going to go with a straightforward approach to generating ngrams: 因此,最初,我将使用一种简单的方法来生成ngram:

>>> from collections import Counter
>>> from itertools import chain, islice
>>> from pprint import pprint
>>> def ngram_generator(token_sequence, order):
...     for i in range(len(token_sequence) + 1 - order):
...         yield tuple(token_sequence[i: i + order])
...
>>> counts = Counter(chain.from_iterable(ngram_generator(sub, 2) for sub in doublelist))
>>> pprint(counts)
Counter({('all', 'the'): 3,
         ('the', 'big'): 2,
         ('chicken', 'all'): 2,
         ('eat', 'paste'): 2,
         ('the', 'small'): 2,
         ('kids', 'eat'): 2,
         ('dogs', 'eat'): 2,
         ('eat', 'chicken'): 2,
         ('small', 'kids'): 2,
         ('big', 'dogs'): 2,
         ('paste', 'lumps'): 1})

But I got inspired by niemmi to write what seems like a more efficient approach, than is again, generalizable to higher order ngrams: 但是我受到涅米的启发而写了一种看起来似乎更有效的方法,而不是将其推广到高阶ngram的方法:

>>> def efficient_ngrams(tokens_sequence, n):
...     iterators = []
...     for i in range(n):
...         it = iter(tokens_sequence)
...         tuple(islice(it, 0, i))
...         iterators.append(it)
...     yield from zip(*iterators)
...

So, observe: 因此,请注意:

>>> pprint(list(efficient_ngrams(doublelist[0], 1)))
[('all',),
 ('the',),
 ('big',),
 ('dogs',),
 ('eat',),
 ('chicken',),
 ('all',),
 ('the',),
 ('small',),
 ('kids',),
 ('eat',),
 ('paste',)]
>>> pprint(list(efficient_ngrams(doublelist[0], 2)))
[('all', 'the'),
 ('the', 'big'),
 ('big', 'dogs'),
 ('dogs', 'eat'),
 ('eat', 'chicken'),
 ('chicken', 'all'),
 ('all', 'the'),
 ('the', 'small'),
 ('small', 'kids'),
 ('kids', 'eat'),
 ('eat', 'paste')]
>>> pprint(list(efficient_ngrams(doublelist[0], 3)))
[('all', 'the', 'big'),
 ('the', 'big', 'dogs'),
 ('big', 'dogs', 'eat'),
 ('dogs', 'eat', 'chicken'),
 ('eat', 'chicken', 'all'),
 ('chicken', 'all', 'the'),
 ('all', 'the', 'small'),
 ('the', 'small', 'kids'),
 ('small', 'kids', 'eat'),
 ('kids', 'eat', 'paste')]
>>>

And of course, it still works for what you want to accomplish: 当然,它仍然可以满足您想要完成的任务:

>>> counts = Counter(chain.from_iterable(efficient_ngrams(sub, 2) for sub in doublelist))
>>> pprint(counts)
Counter({('all', 'the'): 3,
         ('the', 'big'): 2,
         ('chicken', 'all'): 2,
         ('eat', 'paste'): 2,
         ('the', 'small'): 2,
         ('kids', 'eat'): 2,
         ('dogs', 'eat'): 2,
         ('eat', 'chicken'): 2,
         ('small', 'kids'): 2,
         ('big', 'dogs'): 2,
         ('paste', 'lumps'): 1})
>>>

If you are looking for only all and the word ,this could be helpful to you. 如果您正在寻找一切单词,这可能会对你有所帮助。

Code : 代码:

from collections import Counter
doublelist = [["all", "the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste"], ["the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste", "lumps"]]
for i in range(len(doublelist)):
    count = Counter(doublelist[i])
    print "List {} - all = {},the = {}".format(i,count['all'],count['the'])

Output : 输出:

List 0 - all = 2,the = 2
List 1 - all = 1,the = 2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM