[英]Extracting the meaningful information from the text column using python
I have a table with the two columns. 我有一个包含两列的表格。 I have to extract the meaningful information out of it from the Notes column.
我必须从“注释”列中提取出有意义的信息。 ie I need to extract the date in one column and the information after the date in one column and then ID
即我需要将日期提取到一栏中,并将日期之后的信息提取到一栏中,然后提取ID
Notes, ID
Movie Date 05-28-2018 Passed, 1010
MTD loan slip dated 8-10-14 the Issued, 1111
Max over date 10-2-15 and repaired, 11232
output- 输出-
Notes ID Date Status
Movie Date 05-28-2018 Passed 1010 5/28/2018 Passed
loan slip dated 8-10-14 Issued 1111 8/10/2014 Issued
Max over date 10-2-15 and repaired 11232 10/2/2015 repaired
Here is my code- 这是我的代码-
df = pd.read_sql('select * from <table>', engine)
searchfor = [' dated', ' date', ' Date', ' Dated']
df2 = df[df['Notes'] .str.contains('|'.join(searchfor), na = False)]
..................
Appreciate your help on this. 感谢您的帮助。 Thank you.
谢谢。
I would some some loops for that. 我会为此做一些循环。
Example : 范例 :
import pandas as pd
df = pd.read_csv("data.csv")
searchforstatus = [' Passed', ' Issued', ' repaired']
for idx, row in df.iterrows():
for c in searchforstatus:
if c in row['Notes']:
df.loc[idx, 'Status'] = c
Result 结果
Notes ID Status
0 Movie Date 05-28-2018 Passed 1010 Passed
1 MTD loan slip dated 8-10-14 the Issued 1111 Issued
2 Max over date 10-2-15 and repaired 11232 repaired
The data that I used can be found here: https://files.fm/u/npaceyd6#_ 我使用的数据可以在这里找到: https : //files.fm/u/npaceyd6#_
Regex after getting the rows from iterrows() can also extract information, if there can be many possibilities 从iterrows()获取行后的正则表达式也可以提取信息(如果可能的话)
s = 'Movie Date 05-28-2018 Passed'
p = re.search(r'Dated?\s(\d+-\d+-\d+)\s([a-zA-Z]+)',s)
p.group(1) will have the date value and p.group(2) will have the value 'Passed'. p.group(1)将具有日期值,p.group(2)将具有“已通过”值。 Hope this helps..
希望这可以帮助..
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.