简体   繁体   English

NLP - Python中的信息提取(spaCy)

[英]NLP - information extraction in Python (spaCy)

I am attempting to extract this type of information from the following paragraph structure: 我试图从以下段落结构中提取此类信息:

 women_ran men_ran kids_ran walked
         1       2        1      3
         2       4        3      1
         3       6        5      2

text = ["On Tuesday, one women ran on the street while 2 men ran and 1 child ran on the sidewalk. Also, there were 3 people walking.", "One person was walking yesterday, but there were 2 women running as well as 4 men and 3 kids running.", "The other day, there were three women running and also 6 men and 5 kids running on the sidewalk. Also, there were 2 people walking in the park."]

I am using Python's spaCy as my NLP library. 我使用Python的spaCy作为我的NLP库。 I am newer to NLP work and am hoping for some guidance as to what would be the best way to extract this tabular information from such sentences. 我是NLP工作的新手,我希望得到一些指导,说明从这些句子中提取这些表格信息的最佳方法。

If it was simply a matter of identifying whether there were individuals running or walking, I would just use sklearn to fit a classification model, but the information that I need to extract is obviously more granular than that (I am trying to retrieve subcategories and values for each). 如果只是确定是否有个人跑步或走路,我只会使用sklearn来拟合分类模型,但我需要提取的信息显然比那些更细粒度(我试图检索子类别和值每个)。 Any guidance would be greatly appreciated. 任何指导将不胜感激。

You'll want to use the dependency parse for this. 您将需要使用依赖关系解析。 You can see a visualisation of your example sentence using the displaCy visualiser . 您可以使用displaCy可视化工具查看示例句子的可视化

You could implement the rules you need a few different ways — much like how there are always multiple ways to write an XPath query, DOM selector, etc. 您可以通过几种不同的方式实现所需的规则 - 就像总是有多种方法来编写XPath查询,DOM选择器等一样。

Something like this should work: 这样的事情应该有效:

nlp = spacy.load('en')
docs = [nlp(t) for t in text]
for i, doc in enumerate(docs):
    for j, sent in enumerate(doc.sents):
        subjects = [w for w in sent if w.dep_ == 'nsubj']
        for subject in subjects:
            numbers = [w for w in subject.lefts if w.dep_ == 'nummod']
            if len(numbers) == 1:
                print('document.sentence: {}.{}, subject: {}, action: {}, numbers: {}'.format(i, j, subject.text, subject.head.text, numbers[0].text))

For your examples in text you should get: 对于您在text的示例,您应该得到:

document.sentence: 0.0, subject: men, action: ran, numbers: 2
document.sentence: 0.0, subject: child, action: ran, numbers: 1
document.sentence: 0.1, subject: people, action: walking, numbers: 3
document.sentence: 1.0, subject: person, action: walking, numbers: One

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM