简体   繁体   中英

read from txt file and divide words

I would like to create a program in python that reads a txt file as input from the user. Then I would like for the program to seperate the words as follows in the example below:

At the time of his accession, the Swedish Riksdag held more power than the monarchy but was bitterly divided between rival parties.

  • At the time
  • the time of
  • time of his
  • of his accession
  • his accession the ...

And i want this program to save these in a different file. any ideas?

you did not detail what format you want to save the text in a different file. assuming you want it line by line, that would do:

def only_letters(word):
    return ''.join(c for c in word if 'a' <= c <= 'z' or 'A' <= c <= 'Z')

with open('input.txt') as f, open('output.txt', 'w') as w:
    s = f.read()
    words = [only_letters(word) for word in s.split()]
    triplets = [words[i:i + 3] for i in range(len(words) - 2)]
    for triplet in triplets:
        w.write(' '.join(triplet) + '\n')

You can try this, note that it will fail if you don't give it at least 3 words.

def get_words():
    with open("file.txt", "r") as f:
        for word in f.readline().split(" "):
            yield word.replace(",", "").replace(".", "")

with open("output.txt", "w") as f:
    it = get_words()
    current = [""] + [next(it) for _ in range(2)]
    for word in it:
        current = current[1:] + [word]
        f.write(" ".join(current) + "\n")

My understanding is that you want to generate n-grams which is a common practice in text vectorization before doing any NLP. Here is a simple implementation:

from sklearn.feature_extraction.text import CountVectorizer

string = ["At the time of his accession, the Swedish Riksdag held more power than the monarchy but was bitterly divided between rival parties."]
# you can change the ngram_range to get any combination of words
vectorizer = CountVectorizer(encoding='utf-8', stop_words='english', ngram_range=(3,3))

X = vectorizer.fit_transform(string)
print(vectorizer.get_feature_names())

which will give you a list of ngrams with the length of 3, but the order is lost.

['accession the swedish', 'at the time', 'between rival parties', 'bitterly divided between', 'but was bitterly', 'divided between rival', 'held more power', 'his accession the', 'monarchy but was', 'more power than', 'of his accession', 'power than the', 'riksdag held more', 'swedish riksdag held', 'than the monarchy', 'the monarchy but', 'the swedish riksdag', 'the time of', 'time of his', 'was bitterly divided']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM