Counting word pairs from a text file [Python]

Question

So from a text file which has a content:

Lemonade juice whiskey beer soda vodka

In Python, by implementing that same.txt file, I would like to output word-pairs in the next order:

juice-lemonade
whiskey-juice
beer-whiskey
soda-beer
vodka-soda

I managed outputing something like that by using list instead of opening file in Python, but in the case with some major.txt file, that is not really a handy solution. Also, the bonus task for this would be to output the probability for each of those pairs. Any kind of hint would be highly appreciated.

Answer 1

To read large files efficiently, you should read them line-by-line, or (if you have really long lines, which is what the snippet below assumes) token-by-token.

A clean way to do this while keeping an open handle on a file is by using generators that yield a word at a time.

You can have another generator that combines 2 words at a time and yields pairs.

from typing import Iterator

def memory_efficient_word_generator(text_file: str) -> Iterator[str]:
    word = ''
    with open(text_file) as text:
        while True:
            character = text.read(1)
            if not character:
                return
            if character.isspace():
                yield word.lower()
                word = ''
            else:
                word += character


def pair_generator(text_file: str) -> Iterator[str]:
    previous_word = ''
    for word in memory_efficient_word_generator(text_file):
        if previous_word and word:
            yield f'{previous_word}-{word}'
        previous_word = word or previous_word


for pair in pair_generator('filename.txt'):
    print(pair)

Assuming filename.txt contains:

Lemonade juice whiskey beer soda vodka

cola tequila lemonade juice

You should see something like:

lemonade-juice
juice-whiskey
whiskey-beer
beer-soda
soda-vodka
vodka-cola
cola-tequila
tequila-lemonade
lemonade-juice

Of course, there's a lot more you should handle depending on your desired behaviour (for example, handling non-alphabetic characters in your input).

Answer 2

Thank you very much for the feedback. That's pretty much it, I just added encoding = 'utf-8' here:

with open(text_file, encoding='utf-8') as text:

since it outputs error for 'charmap' for me.

And just one more thing, I also wanted to output the number of the elements(words) from the text file by using:

file = open("filename.txt", "rt", encoding="utf8")
data = file.read()
words = data.split()

print('Number of words :', len(words))

which I did, now I'm trying to do the same with those word-pairs that you sent, basically each of those pairs would be one element, like for example:

lemonade-juice ---> one element

So if we would to count all of these from a text file:

lemonade-juice
juice-whiskey
whiskey-beer
beer-soda
soda-vodka
vodka-cola
cola-tequila
tequila-lemonade
lemonade-juice

we would get the output of 9 elements or

Number of word-pairs: 9

Was thinking now to try to do that with using len function and calling text_file . Fix me if I'm looking in a wrong direction.

Once again, thank you for your time.

Counting word pairs from a text file [Python]

Question

2 answers

solution1
0 ACCPTED 2021-01-27 23:28:15

solution2
0 2021-01-28 21:43:23

Counting word pairs from a text file [Python]

Question

2 answers

solution1 0 ACCPTED 2021-01-27 23:28:15

solution2 0 2021-01-28 21:43:23

solution1
0 ACCPTED 2021-01-27 23:28:15

solution2
0 2021-01-28 21:43:23