简体   繁体   中英

Reading and writing POS tagged sentences from text files using NLTK and Python

Does anyone know if there is an existing module or easy method for reading and writing part-of-speech tagged sentences to and from text files? I'm using python and the Natural Language Toolkit (NLTK). For example, this code:

import nltk

sentences = "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world."

tagged = nltk.sent_tokenize(sentences.strip())
tagged = [nltk.word_tokenize(sent) for sent in tagged]
tagged = [nltk.pos_tag(sent) for sent in tagged]

print tagged

Returns this nested list:

[[('Call', 'NNP'), ('me', 'PRP'), ('Ishmael', 'NNP'), ('.', '.')], [('Some', 'DT'), ('years', 'NNS'), ('ago', 'RB'), ('-', ':'), ('never', 'RB'), ('mind', 'VBP'), ('how', 'WRB'), ('long', 'JJ'), ('precisely', 'RB'), ('-', ':'), ('having', 'VBG'), ('little', 'RB'), ('or', 'CC'), ('no', 'DT'), ('money', 'NN'), ('in', 'IN'), ('my', 'PRP$'), ('purse', 'NN'), (',', ','), ('and', 'CC'), ('nothing', 'NN'), ('particular', 'JJ'), ('to', 'TO'), ('interest', 'NN'), ('me', 'PRP'), ('on', 'IN'), ('shore', 'NN'), (',', ','), ('I', 'PRP'), ('thought', 'VBD'), ('I', 'PRP'), ('would', 'MD'), ('sail', 'VB'), ('about', 'IN'), ('a', 'DT'), ('little', 'RB'), ('and', 'CC'), ('see', 'VB'), ('the', 'DT'), ('watery', 'NN'), ('part', 'NN'), ('of', 'IN'), ('the', 'DT'), ('world', 'NN'), ('.', '.')]]

I know I could easily dump this into a pickle, but I really want to export this as a segment of a larger text file. I'd like to be able to export the list to a text file, and then return to it later, parse it, and recover the original list structure. Are there any built in functions in the NLTK for doing this? I've looked but can't find any...

Example output:

<headline>Article headline</headline>
<body>Call me Ishmael...</body>
<pos_tags>[[('Call', 'NNP'), ('me', 'PRP'), ('Ishmael', 'NNP')...</pos_tags>

The NLTK has a standard file format for tagged text. It looks like this:

Call/NNP me/PRP Ishmael/NNP ./.

You should use this format, since it allows you to read your files with the NLTK's TaggedCorpusReader and other similar classes, and get the full range of corpus reader functions. Confusingly, there is no high-level function in the NLTK for writing a tagged corpus in this format, but that's probably because it's pretty trivial:

for sent in tagged:
    print " ".join(word+"/"+tag for word, tag in sent)

(The NLTK does provide nltk.tag.tuple2str() , but it only handles one word-- it's simpler to just type word+"/"+tag ).

If you save your tagged text in one or more files fileN.txt in this format, you can read it back with nltk.corpus.reader.TaggedCorpusReader like this:

mycorpus = nltk.corpus.reader.TaggedCorpusReader("path/to/corpus", "file.*\.txt")
print mycorpus.fileids()
print mycorpus.sents()[0]
for sent in mycorpus.tagged_sents():
    <etc>

Note that the sents() method gives you the untagged text, albeit a bit oddly spaced. There's no need to include both tagged and untagged versions in the file, as in your example.

The TaggedCorpusReader doesn't support file headers (for the title etc.), but if you really need that you can derive your own class that reads the file metadata and then handles the rest like TaggedCorpusReader .

It seems like using pickle.dumps and inserting its output into your text file, perhaps with a tag wrapper for automated loading would satisfy your requirements.

Can you be more specific about what you would like the text output to look like? Are you aiming for something that is more human-readable?

EDIT: adding some code

from xml.dom.minidom import Document, parseString
import nltk

sentences = "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world."

tagged = nltk.sent_tokenize(sentences.strip())
tagged = [nltk.word_tokenize(sent) for sent in tagged]
tagged = [nltk.pos_tag(sent) for sent in tagged]

# Write to xml string
doc = Document()

base = doc.createElement("Document")
doc.appendChild(base)

headline = doc.createElement("headline")
htext = doc.createTextNode("Article Headline")
headline.appendChild(htext)
base.appendChild(headline)

body = doc.createElement("body")
btext = doc.createTextNode(sentences)
headline.appendChild(btext)
base.appendChild(body)

pos_tags = doc.createElement("pos_tags")
tagtext = doc.createTextNode(repr(tagged))
pos_tags.appendChild(tagtext)
base.appendChild(pos_tags)

xmlstring = doc.toxml()

# Read back tagged

doc2 = parseString(xmlstring)
el = doc2.getElementsByTagName("pos_tags")[0]
text = el.firstChild.nodeValue
tagged2 = eval(text)

print "Equal? ", tagged == tagged2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM