So from a text file which has a content:
Lemonade juice whiskey beer soda vodka
In Python, by implementing that same.txt file, I would like to output word-pairs in the next order:
I managed outputing something like that by using list instead of opening file in Python, but in the case with some major.txt file, that is not really a handy solution. Also, the bonus task for this would be to output the probability for each of those pairs. Any kind of hint would be highly appreciated.
To read large files efficiently, you should read them line-by-line, or (if you have really long lines, which is what the snippet below assumes) token-by-token.
A clean way to do this while keeping an open handle on a file is by using generators that yield a word at a time.
You can have another generator that combines 2 words at a time and yields pairs.
from typing import Iterator
def memory_efficient_word_generator(text_file: str) -> Iterator[str]:
word = ''
with open(text_file) as text:
while True:
character = text.read(1)
if not character:
return
if character.isspace():
yield word.lower()
word = ''
else:
word += character
def pair_generator(text_file: str) -> Iterator[str]:
previous_word = ''
for word in memory_efficient_word_generator(text_file):
if previous_word and word:
yield f'{previous_word}-{word}'
previous_word = word or previous_word
for pair in pair_generator('filename.txt'):
print(pair)
Assuming filename.txt
contains:
Lemonade juice whiskey beer soda vodka
cola tequila lemonade juice
You should see something like:
lemonade-juice
juice-whiskey
whiskey-beer
beer-soda
soda-vodka
vodka-cola
cola-tequila
tequila-lemonade
lemonade-juice
Of course, there's a lot more you should handle depending on your desired behaviour (for example, handling non-alphabetic characters in your input).
Thank you very much for the feedback. That's pretty much it, I just added encoding = 'utf-8' here:
with open(text_file, encoding='utf-8') as text:
since it outputs error for 'charmap' for me.
And just one more thing, I also wanted to output the number of the elements(words) from the text file by using:
file = open("filename.txt", "rt", encoding="utf8")
data = file.read()
words = data.split()
print('Number of words :', len(words))
which I did, now I'm trying to do the same with those word-pairs that you sent, basically each of those pairs would be one element, like for example:
lemonade-juice ---> one element
So if we would to count all of these from a text file:
lemonade-juice
juice-whiskey
whiskey-beer
beer-soda
soda-vodka
vodka-cola
cola-tequila
tequila-lemonade
lemonade-juice
we would get the output of 9 elements or
Number of word-pairs: 9
Was thinking now to try to do that with using len
function and calling text_file
. Fix me if I'm looking in a wrong direction.
Once again, thank you for your time.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.