简体   繁体   English

计算文本文件中的单词对 [Python]

[英]Counting word pairs from a text file [Python]

So from a text file which has a content:因此,从具有内容的文本文件中:

Lemonade juice whiskey beer soda vodka柠檬汁威士忌啤酒苏打伏特加

In Python, by implementing that same.txt file, I would like to output word-pairs in the next order:在 Python 中,通过实现该 same.txt 文件,我想按以下顺序对 output 字对:

  • juice-lemonade果汁柠檬水
  • whiskey-juice威士忌酒
  • beer-whiskey啤酒威士忌
  • soda-beer苏打啤酒
  • vodka-soda伏特加苏打水

I managed outputing something like that by using list instead of opening file in Python, but in the case with some major.txt file, that is not really a handy solution.我通过使用列表而不是在 Python 中打开文件来管理输出类似的内容,但对于某些 major.txt 文件,这并不是一个真正方便的解决方案。 Also, the bonus task for this would be to output the probability for each of those pairs.此外,为此的奖励任务将是 output 每对的概率。 Any kind of hint would be highly appreciated.任何形式的提示将不胜感激。

To read large files efficiently, you should read them line-by-line, or (if you have really long lines, which is what the snippet below assumes) token-by-token.为了有效地读取大文件,您应该逐行读取它们,或者(如果您的行很长,这是下面的代码片段所假设的)逐个标记。

A clean way to do this while keeping an open handle on a file is by using generators that yield a word at a time. 在保持文件打开句柄的同时执行此操作的一种简洁方法是使用一次生成一个单词的生成器。

You can have another generator that combines 2 words at a time and yields pairs.您可以使用另一个生成器,一次组合 2 个单词并生成对。

from typing import Iterator

def memory_efficient_word_generator(text_file: str) -> Iterator[str]:
    word = ''
    with open(text_file) as text:
        while True:
            character = text.read(1)
            if not character:
                return
            if character.isspace():
                yield word.lower()
                word = ''
            else:
                word += character


def pair_generator(text_file: str) -> Iterator[str]:
    previous_word = ''
    for word in memory_efficient_word_generator(text_file):
        if previous_word and word:
            yield f'{previous_word}-{word}'
        previous_word = word or previous_word


for pair in pair_generator('filename.txt'):
    print(pair)

Assuming filename.txt contains:假设filename.txt包含:

Lemonade juice whiskey beer soda vodka柠檬汁威士忌啤酒苏打伏特加

cola tequila lemonade juice可乐龙舌兰柠檬水

You should see something like:您应该会看到如下内容:

lemonade-juice
juice-whiskey
whiskey-beer
beer-soda
soda-vodka
vodka-cola
cola-tequila
tequila-lemonade
lemonade-juice

Of course, there's a lot more you should handle depending on your desired behaviour (for example, handling non-alphabetic characters in your input).当然,根据您想要的行为(例如,处理输入中的非字母字符),您应该处理更多内容。

Thank you very much for the feedback.非常感谢您的反馈。 That's pretty much it, I just added encoding = 'utf-8' here:差不多就是这样,我只是在这里添加了 encoding = 'utf-8' :

with open(text_file, encoding='utf-8') as text:

since it outputs error for 'charmap' for me.因为它为我输出了“charmap”的错误。

And just one more thing, I also wanted to output the number of the elements(words) from the text file by using:还有一件事,我还想 output 使用以下方法从文本文件中获取元素(单词)的数量:

file = open("filename.txt", "rt", encoding="utf8")
data = file.read()
words = data.split()

print('Number of words :', len(words))

which I did, now I'm trying to do the same with those word-pairs that you sent, basically each of those pairs would be one element, like for example:我这样做了,现在我正在尝试对您发送的那些单词对做同样的事情,基本上这些单词对中的每一对都是一个元素,例如:

lemonade-juice ---> one element柠檬汁--->一种元素

So if we would to count all of these from a text file:因此,如果我们要从文本文件中计算所有这些:

lemonade-juice
juice-whiskey
whiskey-beer
beer-soda
soda-vodka
vodka-cola
cola-tequila
tequila-lemonade
lemonade-juice

we would get the output of 9 elements or我们会得到 9 个元素的 output 或

Number of word-pairs: 9

Was thinking now to try to do that with using len function and calling text_file .现在正在考虑尝试使用len function 并调用text_file来做到这一点。 Fix me if I'm looking in a wrong direction.如果我看错了方向,请修复我。

Once again, thank you for your time.再次感谢您的宝贵时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM