简体   繁体   中英

How to split a text file to its words in python?

I am very new to python and also didn't work with text before...I have 100 text files, each has around 100 to 150 lines of unstructured text describing patient's condition. I read one file in python using:

with open("C:\\...\\...\\...\\record-13.txt") as f:
    content = f.readlines()
    print (content) 

Now I can split each line of this file to its words using for example:

a = content[0].split()
print (a)

but I don't know how to split whole file to words? do loops (while or for) help with that?


Thank you for your help guys. Your answers help me to write this (in my file, words are split by space so that's delimiter I think!):

with open ("C:\\...\\...\\...\\record-13.txt") as f:
  lines = f.readlines()
  for line in lines:
      words = line.split()
      for word in words:
          print (word)

that simply splits words by line (one word in one line).

It depends on how you define words , or what you regard as the delimiters .
Notice string.split in Python receives an optional parameter delimiter , so you could pass it as this:

for lines in content[0].split():
    for word in lines.split(','):
        print(word)

Unfortunately, string.split receives a single delimiter only, so you may need multi-level splitting like this:

for lines in content[0].split():
    for split0 in lines.split(' '):
        for split1 in split0.split(','):
            for split2 in split1.split('.'):
                for split3 in split2.split('?'):
                    for split4 in split3.split('!'):
                        for word in split4.split(':'): 
                            if word != "":
                                print(word)

Looks ugly, right? Luckily we can use iteration instead:

delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = content
for delimiter in delimiters:
    new_words = []
    for word in words:
        new_words += word.split(delimiter)
    words = new_words

EDITED: Or simply we could use the regular expression package:

import re
delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = re.split('|'.join(delimiters), content)
with open("C:\...\...\...\record-13.txt") as f:
    for line in f:
        for word in line.split():
            print word

Or, this gives you a list of words

with open("C:\...\...\...\record-13.txt") as f:
    words = [word for line in f for word in line.split()]

Or, this gives you a list of lines, but with each line as a list of words.

with open("C:\...\...\...\record-13.txt") as f:
    words = [line.split() for line in f]

Nobody has suggested a generator, I'm surprised. Here's how I would do it:

def words(stringIterable):
    #upcast the argument to an iterator, if it's an iterator already, it stays the same
    lineStream = iter(stringIterable)
    for line in lineStream: #enumerate the lines
        for word in line.split(): #further break them down
            yield word

Now this can be used both on simple lists of sentences that you might have in memory already:

listOfLines = ['hi there', 'how are you']
for word in words(listOfLines):
    print(word)

But it will work just as well on a file, without needing to read the whole file in memory:

with open('words.py', 'r') as myself:
    for word in words(myself):
        print(word)

I would use Natural Language Tool Kit as the split() way does not deal well with punctuation.

import nltk

for line in file:
    words = nltk.word_tokenize(line)

The most flexible approach is to use list comprehension to generate a list of words:

with open("C:\...\...\...\record-13.txt") as f:
    words = [word
             for line in f
             for word in line.split()]

# Do what you want with the words list

Which you can then iterate over, add to a collections.Counter or anything else you please.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM