简体   繁体   中英

Counting word pairs from a text file [Python]

So from a text file which has a content:

Lemonade juice whiskey beer soda vodka

In Python, by implementing that same.txt file, I would like to output word-pairs in the next order:

  • juice-lemonade
  • whiskey-juice
  • beer-whiskey
  • soda-beer
  • vodka-soda

I managed outputing something like that by using list instead of opening file in Python, but in the case with some major.txt file, that is not really a handy solution. Also, the bonus task for this would be to output the probability for each of those pairs. Any kind of hint would be highly appreciated.

To read large files efficiently, you should read them line-by-line, or (if you have really long lines, which is what the snippet below assumes) token-by-token.

A clean way to do this while keeping an open handle on a file is by using generators that yield a word at a time.

You can have another generator that combines 2 words at a time and yields pairs.

from typing import Iterator

def memory_efficient_word_generator(text_file: str) -> Iterator[str]:
    word = ''
    with open(text_file) as text:
        while True:
            character = text.read(1)
            if not character:
                return
            if character.isspace():
                yield word.lower()
                word = ''
            else:
                word += character


def pair_generator(text_file: str) -> Iterator[str]:
    previous_word = ''
    for word in memory_efficient_word_generator(text_file):
        if previous_word and word:
            yield f'{previous_word}-{word}'
        previous_word = word or previous_word


for pair in pair_generator('filename.txt'):
    print(pair)

Assuming filename.txt contains:

Lemonade juice whiskey beer soda vodka

cola tequila lemonade juice

You should see something like:

lemonade-juice
juice-whiskey
whiskey-beer
beer-soda
soda-vodka
vodka-cola
cola-tequila
tequila-lemonade
lemonade-juice

Of course, there's a lot more you should handle depending on your desired behaviour (for example, handling non-alphabetic characters in your input).

Thank you very much for the feedback. That's pretty much it, I just added encoding = 'utf-8' here:

with open(text_file, encoding='utf-8') as text:

since it outputs error for 'charmap' for me.

And just one more thing, I also wanted to output the number of the elements(words) from the text file by using:

file = open("filename.txt", "rt", encoding="utf8")
data = file.read()
words = data.split()

print('Number of words :', len(words))

which I did, now I'm trying to do the same with those word-pairs that you sent, basically each of those pairs would be one element, like for example:

lemonade-juice ---> one element

So if we would to count all of these from a text file:

lemonade-juice
juice-whiskey
whiskey-beer
beer-soda
soda-vodka
vodka-cola
cola-tequila
tequila-lemonade
lemonade-juice

we would get the output of 9 elements or

Number of word-pairs: 9

Was thinking now to try to do that with using len function and calling text_file . Fix me if I'm looking in a wrong direction.

Once again, thank you for your time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM