繁体   English   中英

使用python从文本中提取城市名称

[英]Extract city names from text using python

我有一个数据集,其中一列的标题是“您的位置和时区是什么?”

这意味着我们有类似

  1. 丹麦,英语四级考试
  2. 地点是英国德文郡,格林尼治标准时间(GMT)时区
  3. 澳大利亚。 澳大利亚东部标准时间。 + 10h UTC。

乃至

  1. 根据学校假期,一年中的大部分时间我的地点是俄勒冈州的尤金,或在韩国的首尔。 我的主要时区是太平洋时区。
  2. 整个五月,我将在英国伦敦(GMT + 1)。 对于整个六月,我将在互联网访问受限的挪威(GMT + 2)或以色列(GMT + 3)居住。 在整个7月和8月,我将在英国伦敦(GMT + 1)。 然后从2015年9月起,我将在美国波士顿(EDT)

有什么方法可以从中提取城市,国家和时区吗?

我正在考虑使用所有国家名称(包括简短形式)以及城市名称/时区创建一个数组(从开放源数据集),然后在数据集中是否有任何单词与城市/国家/时区匹配,或者简短形式将其填充到同一数据集中的新列中并进行计数。

这可行吗?

===========基于NLTK答案的复制=============

运行与我得到的Alecxe相同的代码

Traceback (most recent call last):
  File "E:\SBTF\ntlk_test.py", line 19, in <module>
    tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\__init__.py", line 110, in pos_tag
    tagger = PerceptronTagger()
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\perceptron.py", line 141, in __init__
    self.load(AP_MODEL_LOC)
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\perceptron.py", line 209, in load
    self.model.weights, self.tagdict, self.classes = load(loc)
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\data.py", line 801, in load
    opened_resource = _open(resource_url)
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\data.py", line 924, in _open
    return urlopen(resource_url)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 431, in open
    response = self._open(req, data)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 454, in _open
    'unknown_open', req)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 1265, in unknown_open
    raise URLError('unknown url type: %s' % type)
URLError: <urlopen error unknown url type: c>

我将使用自然语言处理和nltk提供的功能来提取实体

示例(基于此gist )(该示例主要基于此要点 )将文件中的每一行标记化,将其拆分为多个块,然后递归查找每个块的NE (命名实体)标签。 在这里更多的解释:

import nltk

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

with open('sample.txt', 'r') as f:
    for line in f:
        sentences = nltk.sent_tokenize(line)
        tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
        tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
        chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

        entities = []
        for tree in chunked_sentences:
            entities.extend(extract_entity_names(tree))

        print(entities)

对于包含以下内容的sample.txt

Denmark, CET
Location is Devon, England, GMT time zone
Australia. Australian Eastern Standard Time. +10h UTC.
My location is Eugene, Oregon for most of the year or in Seoul, South Korea depending on school holidays. My primary time zone is the Pacific time zone.
For the entire May I will be in London, United Kingdom (GMT+1). For the entire June I will be in either Norway (GMT+2) or Israel (GMT+3) with limited internet access. For the entire July and August I will be in London, United Kingdom (GMT+1). And then from September, 2015, I will be in Boston, United States (EDT)

它打印:

['Denmark', 'CET']
['Location', 'Devon', 'England', 'GMT']
['Australia', 'Australian Eastern Standard Time']
['Eugene', 'Oregon', 'Seoul', 'South Korea', 'Pacific']
['London', 'United Kingdom', 'Norway', 'Israel', 'London', 'United Kingdom', 'Boston', 'United States', 'EDT']

输出不是理想的,但是对于您来说可能是一个好的开始。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM