简体   繁体   English

从文本中提取特定信息

[英]Extracting specific information from text

I'd like to get some data from text file. 我想从文本文件中获取一些数据。 I've decided to do it using Natural Language Toolkit , but I'm open to suggestions if there is a better way to do this. 我已经决定使用Natural Language Toolkit来做到这一点,但是如果有更好的方法可以提出建议。

Here is an example: 这是一个例子:

I need a flight from New York NY to San Francisco CA. 我需要从纽约到纽约的航班。

From this text, I'd like to get city and state for origin and destination. 从此文本中,我想获得起点和目的地的城市和州。

Here is what I have so far: 这是我到目前为止的内容:

import nltk
from nltk.text import *
from nltk.corpus import PlaintextCorpusReader

def readfiles():    
    corpus_root = 'C:\prototype\emails'
    w = PlaintextCorpusReader(corpus_root, '.*')
    t = Text(w.words())
    print "--- to ----"
    print t.concordance("to")

    print "--- from ----"
    print t.concordance("from")

I can read the text from some input (file in my case) then use concordance method to find all the usages of it. 我可以从某些输入(在我的情况下为文件)中读取文本,然后使用一致方法查找其所有用法。 I want to extract the city, state information that comes after 'to' and 'from'. 我想提取“至”和“从”之后的城市,州信息。

Question is what is the best way to look at text that is after the instances of 'to' and 'from'? 问题是查看“ to”和“ from”实例之后的文本的最佳方式是什么?

Perhaps you're better off reading the file in line by line? 也许最好逐行读取文件?
Then something as simple as: 然后简单一些:

cityState = dataAfterTo.split(",")
city = cityState[0]
state = cityState[1].split()[0]

Unless you're dealing with user generated content of course. 当然,除非您要处理用户生成的内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM