![](/img/trans.png)
[英]Extracting data elements from large unstructured text files with Python
[英]Extracting a person's age from unstructured text in Python
我有一个包含简短传记的行政文件数据集。 我试图通过使用 python 和一些模式匹配来提取人们的年龄。 一些句子的例子是:
这些是我在数据集中识别的一些模式。 我想补充一点,还有其他模式,但我还没有遇到它们,也不知道如何才能做到。 我编写了以下代码,效果很好,但效率很低,因此在整个数据集上运行需要太多时间。
#Create a search list of expressions that might come right before an age instance
age_search_list = [" " + last_name.lower().strip() + ", age ",
" " + clean_sec_last_name.lower().strip() + " age ",
last_name.lower().strip() + " age ",
full_name.lower().strip() + ", age ",
full_name.lower().strip() + ", ",
" " + last_name.lower() + ", ",
" " + last_name.lower().strip() + " \(",
" " + last_name.lower().strip() + " is "]
#for each element in our search list
for element in age_search_list:
print("Searching: ",element)
# retrieve all the instances where we might have an age
for age_biography_instance in re.finditer(element,souptext.lower()):
#extract the next four characters
age_biography_start = int(age_biography_instance.start())
age_instance_start = age_biography_start + len(element)
age_instance_end = age_instance_start + 4
age_string = souptext[age_instance_start:age_instance_end]
#extract what should be the age
potential_age = age_string[:-2]
#extract the next two characters as a security check (i.e. age should be followed by comma, or dot, etc.)
age_security_check = age_string[-2:]
age_security_check_list = [", ",". ",") "," y"]
if age_security_check in age_security_check_list:
print("Potential age instance found for ",full_name,": ",potential_age)
#check that what we extracted is an age, convert it to birth year
try:
potential_age = int(potential_age)
print("Potential age detected: ",potential_age)
if 18 < int(potential_age) < 100:
sec_birth_year = int(filing_year) - int(potential_age)
print("Filing year was: ",filing_year)
print("Estimated birth year for ",clean_sec_full_name,": ",sec_birth_year)
#Now, we save it in the main dataframe
new_sec_parser = pd.DataFrame([[clean_sec_full_name,"0","0",sec_birth_year,""]],columns = ['Name','Male','Female','Birth','Suffix'])
df_sec_parser = pd.concat([df_sec_parser,new_sec_parser])
except ValueError:
print("Problem with extracted age ",potential_age)
我有几个问题:
从数据集中提取的一些句子:
import re
x =["Mr Bond, 67, is an engineer in the UK"
,"Amanda B. Bynes, 34, is an actress"
,"Peter Parker (45) will be our next administrator"
,"Mr. Dylan is 46 years old."
,"Steve Jones, Age:32,"]
[re.findall(r'\d{1,3}', i)[0] for i in x] # ['67', '34', '45', '46', '32']
这适用于您提供的所有案例: https : //repl.it/repls/NotableAncientBackground
import re
input =["Mr Bond, 67, is an engineer in the UK"
,"Amanda B. Bynes, 34, is an actress"
,"Peter Parker (45) will be our next administrator"
,"Mr. Dylan is 46 years old."
,"Steve Jones, Age:32,", "Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation",
"George F. Rubin(14)(15) Age 68 Trustee since: 1997.",
"INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006",
"Mr. Lovallo, 47, was appointed Treasurer in 2011.",
"Mr. Charles Baker, 79, is a business advisor to biotechnology companies.",
"Mr. Botein, age 43, has been a member of our Board since our formation."]
for i in input:
age = re.findall(r'Age[\:\s](\d{1,3})', i)
age.extend(re.findall(r' (\d{1,3}),? ', i))
if len(age) == 0:
age = re.findall(r'\((\d{1,3})\)', i)
print(i+ " --- AGE: "+ str(set(age)))
返回
Mr Bond, 67, is an engineer in the UK --- AGE: {'67'}
Amanda B. Bynes, 34, is an actress --- AGE: {'34'}
Peter Parker (45) will be our next administrator --- AGE: {'45'}
Mr. Dylan is 46 years old. --- AGE: {'46'}
Steve Jones, Age:32, --- AGE: {'32'}
Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation --- AGE: set()
George F. Rubin(14)(15) Age 68 Trustee since: 1997. --- AGE: {'68'}
INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006 --- AGE: {'56'}
Mr. Lovallo, 47, was appointed Treasurer in 2011. --- AGE: {'47'}
Mr. Charles Baker, 79, is a business advisor to biotechnology companies. --- AGE: {'79'}
Mr. Botein, age 43, has been a member of our Board since our formation. --- AGE: {'43'}
从句子中查找人的年龄的简单方法是提取2位数字:
import re
sentence = 'Steve Jones, Age: 32,'
print(re.findall(r"\b\d{2}\b", 'Steve Jones, Age: 32,')[0])
# output: 32
如果您不希望%
位于数字的末尾,并且您希望在begening中有空白区域,则可以执行以下操作:
sentence = 'Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation'
match = re.findall(r"\b\d{2}(?!%)[^\d]", sentence)
if match:
print(re.findall(r"\b\d{2}(?!%)[^\d]", sentence)[0][:2])
else:
print('no match')
# output: no match
也适用于前一句
从你给出的例子来看,这是我提出的策略:
步骤1:
检查语句中是否有年龄Regex :( (?i)(Age).*?(\\d+)
以上将照顾这样的例子 :
- George F. Rubin(14岁)(15岁)受托人自1997年以来。
- 史蒂夫琼斯,年龄:32岁
第2步:
- 检查“%”符号是否为句子,如果是,则删除带有符号的号码
- 如果句子中没有“年龄”,则写一个正则表达式以删除所有4位数字。 示例正则表达式: \\b\\d{4}\\b
- 然后看看句子中是否还有任何数字,这将是你的年龄
涵盖的示例如下 :
- 2010年授予Love先生的平等奖励占其总薪酬的48%“ - 不会留下任何数字
- “自2006年以来,56岁的INDRA K. NOOYI一直是百事可乐首席执行官(CEO)” - 只留下56位
- “47岁的Lovallo先生于2011年被任命为财务主管。” - 只剩47个
这可能不是完整的答案,因为您也可以有其他模式。 但是既然你要求制定战略和你发布的例子,这将适用于所有情况
由于您的文本必须被处理,而且不仅模式匹配,因此正确的方法是使用其中的许多NLP工具之一。
您的目标是使用命名实体识别(NER) ,这通常基于机器学习模型完成。 NER活动尝试识别文本中确定的一组实体类型 。 示例包括: 位置,日期,组织和人员姓名 。
虽然不是100%精确, 但这比简单模式匹配 (尤其是英语) 要精确得多 ,因为它依赖于除模式之外的其他信息,例如词性(POS),依赖性解析等。
通过使用Allen NLP在线工具 (使用细粒度NER模型)查看我为您提供的短语获得的结果:
请注意,最后一个是错误的。 正如我所说,不是100%,而是易于使用。
这种方法的最大优点是: 您无需为数以百万计的可能性中的每一种创建特殊模式。
最棒的是:您可以将它集成到Python代码中:
pip install allennlp
和:
from allennlp.predictors import Predictor
al = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/fine-
grained-ner-model-elmo-2018.12.21.tar.gz")
al.predict("Your sentence with date here")
然后,查看“Date”实体的结果dict。
Spacy也是如此:
!python3 -m spacy download en_core_web_lg
import spacy
sp_lg = spacy.load('en_core_web_lg')
{(ent.text.strip(), ent.label_) for ent in sp_lg("Your sentence with date here").ents}
(但是,我对那里的错误预测有一些不好的经历 - 尽管它被认为更好)。
欲了解更多信息,请阅读中文这篇有趣的文章: https : //medium.com/@b.terryjack/nlp-pretrained-named-entity-recognition-7caa5cd28d7b
除了使用正则表达式,您还可以使用Spacy 模式匹配。 下面的模式是可行的,但您可能需要添加一些额外的东西以确保您不会接受百分比和货币价值。
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
age_patterns = [
# e.g Steve Jones, Age: 32,
[{"LOWER": "aged"}, {"IS_PUNCT": True,"OP":"?"},{"LIKE_NUM": True}],
[{"LOWER": "age"}, {"IS_PUNCT": True,"OP":"?"}, {"LIKE_NUM": True}],
# e.g "Peter Parker (45) will be our next administrator" OR "Amanda B. Bynes, 34, is an actress"
[{'POS':'PROPN'},{"IS_PUNCT": True}, {"LIKE_NUM": True}, {"IS_PUNCT": True}],
# e.g "Mr. Dylan is 46 years old."
[{"LIKE_NUM": True},{"IS_PUNCT": True,"OP":"*"},{"LEMMA": "year"}, {"IS_PUNCT": True,"OP":"*"},
{"LEMMA": "old"},{"IS_ALPHA": True, "OP":"*"},{'POS':'PROPN',"OP":"*"},{'POS':'PROPN',"OP":"*"} ]
]
doc = nlp(text)
matcher = Matcher(nlp.vocab)
matcher.add("matching", age_patterns)
matches = matcher(doc)
schemes = []
for i in range(0,len(matches)):
# match: id, start, end
start, end = matches[i][1], matches[i][2]
if doc[start].pos_=='DET':
start = start+1
# matched string
span = str(doc[start:end])
if (len(schemes)!=0) and (schemes[-1] in span):
schemes[-1] = span
else:
schemes.append(span)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.