繁体   English   中英

在 Python 中从非结构化文本中提取一个人的年龄

[英]Extracting a person's age from unstructured text in Python

我有一个包含简短传记的行政文件数据集。 我试图通过使用 python 和一些模式匹配来提取人们的年龄。 一些句子的例子是:

  • “邦德先生,67 岁,是英国的一名工程师”
  • “阿曼达·B·拜恩斯,34 岁,是一名女演员”
  • “彼得帕克 (45) 将成为我们的下一任管理员”
  • “迪伦先生今年 46 岁。”
  • “史蒂夫·琼斯,年龄:32,”

这些是我在数据集中识别的一些模式。 我想补充一点,还有其他模式,但我还没有遇到它们,也不知道如何才能做到。 我编写了以下代码,效果很好,但效率很低,因此在整个数据集上运行需要太多时间。

#Create a search list of expressions that might come right before an age instance
age_search_list = [" " + last_name.lower().strip() + ", age ",
" " + clean_sec_last_name.lower().strip() + " age ",
last_name.lower().strip() + " age ",
full_name.lower().strip() + ", age ",
full_name.lower().strip() + ", ",
" " + last_name.lower() + ", ",
" " + last_name.lower().strip()  + " \(",
" " + last_name.lower().strip()  + " is "]

#for each element in our search list
for element in age_search_list:
    print("Searching: ",element)

    # retrieve all the instances where we might have an age
    for age_biography_instance in re.finditer(element,souptext.lower()):

        #extract the next four characters
        age_biography_start = int(age_biography_instance.start())
        age_instance_start = age_biography_start + len(element)
        age_instance_end = age_instance_start + 4
        age_string = souptext[age_instance_start:age_instance_end]

        #extract what should be the age
        potential_age = age_string[:-2]

        #extract the next two characters as a security check (i.e. age should be followed by comma, or dot, etc.)
        age_security_check = age_string[-2:]
        age_security_check_list = [", ",". ",") "," y"]

        if age_security_check in age_security_check_list:
            print("Potential age instance found for ",full_name,": ",potential_age)

            #check that what we extracted is an age, convert it to birth year
            try:
                potential_age = int(potential_age)
                print("Potential age detected: ",potential_age)
                if 18 < int(potential_age) < 100:
                    sec_birth_year = int(filing_year) - int(potential_age)
                    print("Filing year was: ",filing_year)
                    print("Estimated birth year for ",clean_sec_full_name,": ",sec_birth_year)
                    #Now, we save it in the main dataframe
                    new_sec_parser = pd.DataFrame([[clean_sec_full_name,"0","0",sec_birth_year,""]],columns = ['Name','Male','Female','Birth','Suffix'])
                    df_sec_parser = pd.concat([df_sec_parser,new_sec_parser])

            except ValueError:
                print("Problem with extracted age ",potential_age)

我有几个问题:

  • 有没有更有效的方法来提取这些信息?
  • 我应该使用正则表达式吗?
  • 我的文本文件很长,而且我有很多。 我可以一次搜索所有项目吗?
  • 检测数据集中其他模式的策略是什么?

从数据集中提取的一些句子:

  • “2010 年授予洛夫先生的股权奖励占其总薪酬的 48%”
  • “George F. Rubin(14)(15) 68 岁受托人,自:1997 年起。”
  • “INDRA K. NOOYI,56 岁,自 2006 年以来一直担任百事可乐首席执行官 (CEO)”
  • “47 岁的 Lovallo 先生于 2011 年被任命为财务主管。”
  • “查尔斯·贝克先生,79 岁,是生物技术公司的商业顾问。”
  • “Botein 先生,43 岁,自我们成立以来一直是我们董事会的成员。”
import re 

x =["Mr Bond, 67, is an engineer in the UK"
,"Amanda B. Bynes, 34, is an actress"
,"Peter Parker (45) will be our next administrator"
,"Mr. Dylan is 46 years old."
,"Steve Jones, Age:32,"]

[re.findall(r'\d{1,3}', i)[0] for i in x] # ['67', '34', '45', '46', '32']

这适用于您提供的所有案例: https//repl.it/repls/NotableAncientBackground

import re 

input =["Mr Bond, 67, is an engineer in the UK"
,"Amanda B. Bynes, 34, is an actress"
,"Peter Parker (45) will be our next administrator"
,"Mr. Dylan is 46 years old."
,"Steve Jones, Age:32,", "Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation",
"George F. Rubin(14)(15) Age 68 Trustee since: 1997.",
"INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006",
"Mr. Lovallo, 47, was appointed Treasurer in 2011.",
"Mr. Charles Baker, 79, is a business advisor to biotechnology companies.",
"Mr. Botein, age 43, has been a member of our Board since our formation."]
for i in input:
  age = re.findall(r'Age[\:\s](\d{1,3})', i)
  age.extend(re.findall(r' (\d{1,3}),? ', i))
  if len(age) == 0:
    age = re.findall(r'\((\d{1,3})\)', i)
  print(i+ " --- AGE: "+ str(set(age)))

返回

Mr Bond, 67, is an engineer in the UK --- AGE: {'67'}
Amanda B. Bynes, 34, is an actress --- AGE: {'34'}
Peter Parker (45) will be our next administrator --- AGE: {'45'}
Mr. Dylan is 46 years old. --- AGE: {'46'}
Steve Jones, Age:32, --- AGE: {'32'}
Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation --- AGE: set()
George F. Rubin(14)(15) Age 68 Trustee since: 1997. --- AGE: {'68'}
INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006 --- AGE: {'56'}
Mr. Lovallo, 47, was appointed Treasurer in 2011. --- AGE: {'47'}
Mr. Charles Baker, 79, is a business advisor to biotechnology companies. --- AGE: {'79'}
Mr. Botein, age 43, has been a member of our Board since our formation. --- AGE: {'43'}

从句子中查找人的年龄的简单方法是提取2位数字:

import re

sentence = 'Steve Jones, Age: 32,'
print(re.findall(r"\b\d{2}\b", 'Steve Jones, Age: 32,')[0])

# output: 32

如果您不希望%位于数字的末尾,并且您希望在begening中有空白区域,则可以执行以下操作:

sentence = 'Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation'

match = re.findall(r"\b\d{2}(?!%)[^\d]", sentence)

if match:
    print(re.findall(r"\b\d{2}(?!%)[^\d]", sentence)[0][:2])
else:
    print('no match')

# output: no match

也适用于前一句

从你给出的例子来看,这是我提出的策略:

步骤1:

检查语句中是否有年龄Regex :( (?i)(Age).*?(\\d+)

以上将照顾这样的例子

- George F. Rubin(14岁)(15岁)受托人自1997年以来。

- 史蒂夫琼斯,年龄:32岁

第2步:

- 检查“%”符号是否为句子,如果是,则删除带有符号的号码

- 如果句子中没有“年龄”,则写一个正则表达式以删除所有4位数字。 示例正则表达式: \\b\\d{4}\\b

- 然后看看句子中是否还有任何数字,这将是你的年龄

涵盖的示例如下

- 2010年授予Love先生的平等奖励占其总薪酬的48%“ - 不会留下任何数字

- “自2006年以来,56岁的INDRA K. NOOYI一直是百事可乐首席执行官(CEO)” - 只留下56位

- “47岁的Lovallo先生于2011年被任命为财务主管。” - 只剩47个

这可能不是完整的答案,因为您也可以有其他模式。 但是既然你要求制定战略和你发布的例子,这将适用于所有情况

由于您的文本必须被处理,而且不仅模式匹配,因此正确的方法是使用其中的许多NLP工具之一。

您的目标是使用命名实体识别(NER) ,这通常基于机器学习模型完成。 NER活动尝试识别文本中确定的一组实体类型 示例包括: 位置,日期,组织和人员姓名

虽然不是100%精确, 但这比简单模式匹配 (尤其是英语) 要精确得多 ,因为它依赖于除模式之外的其他信息,例如词性(POS),依赖性解析等。

通过使用Allen NLP在线工具 (使用细粒度NER模型)查看我为您提供的短语获得的结果:

  • “67岁的邦德先生是英国的工程师”:

现年67岁的邦德先生是英国的工程师

  • “Amanda B. Bynes,34岁,是一位演员”

34岁的Amanda B. Bynes是一位演员

  • “Peter Parker(45岁)将成为我们的下一任管理员”

Peter Parker(45岁)将成为我们的下一任管理员

  • “迪伦先生现年46岁。”

迪伦先生今年46岁。

  • “史蒂夫琼斯,年龄:32岁,”

史蒂夫琼斯,年龄:32岁,

请注意,最后一个是错误的。 正如我所说,不是100%,而是易于使用。

这种方法的最大优点是: 您无需为数以百万计的可能性中的每一种创建特殊模式。

最棒的是:您可以将它集成到Python代码中:

pip install allennlp

和:

from allennlp.predictors import Predictor
al = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/fine- 
grained-ner-model-elmo-2018.12.21.tar.gz")
al.predict("Your sentence with date here")

然后,查看“Date”实体的结果dict。

Spacy也是如此:

!python3 -m spacy download en_core_web_lg
import spacy
sp_lg = spacy.load('en_core_web_lg')
{(ent.text.strip(), ent.label_) for ent in sp_lg("Your sentence with date here").ents}

(但是,我对那里的错误预测有一些不好的经历 - 尽管它被认为更好)。

欲了解更多信息,请阅读中文这篇有趣的文章: https//medium.com/@b.terryjack/nlp-pretrained-named-entity-recognition-7caa5cd28d7b

除了使用正则表达式,您还可以使用Spacy 模式匹配 下面的模式是可行的,但您可能需要添加一些额外的东西以确保您不会接受百分比和货币价值。

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher 

age_patterns = [
# e.g Steve Jones, Age: 32,
[{"LOWER": "aged"}, {"IS_PUNCT": True,"OP":"?"},{"LIKE_NUM": True}],
[{"LOWER": "age"}, {"IS_PUNCT": True,"OP":"?"}, {"LIKE_NUM": True}],
# e.g "Peter Parker (45) will be our next administrator" OR "Amanda B. Bynes, 34, is an actress"
[{'POS':'PROPN'},{"IS_PUNCT": True}, {"LIKE_NUM": True}, {"IS_PUNCT": True}],
# e.g "Mr. Dylan is 46 years old."
[{"LIKE_NUM": True},{"IS_PUNCT": True,"OP":"*"},{"LEMMA": "year"}, {"IS_PUNCT": True,"OP":"*"},
 {"LEMMA": "old"},{"IS_ALPHA": True, "OP":"*"},{'POS':'PROPN',"OP":"*"},{'POS':'PROPN',"OP":"*"}  ]
]

doc = nlp(text)
matcher = Matcher(nlp.vocab) 
matcher.add("matching", age_patterns) 
matches = matcher(doc)

schemes = []
for i in range(0,len(matches)):

    # match: id, start, end
    start, end = matches[i][1], matches[i][2]

    if doc[start].pos_=='DET':
        start = start+1

    # matched string
    span = str(doc[start:end])

    if (len(schemes)!=0) and (schemes[-1] in span):
        schemes[-1] = span
    else:
        schemes.append(span)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM