简体   繁体   English

从数据集中提取文本

[英]Extracting text from dataset

I am working on a dataset in which I need to extract all the available dates.我正在处理一个需要提取所有可用日期的数据集。 Dates can be of the following format:日期可以是以下格式:

04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010

I wrote the below code:我写了下面的代码:

df['dates'] = df['text'].str.extract(r'((?:\d{1,2}[/ ])?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec[a-z.,]*[- ])?(?:\d{1,2}[a-z-, /]{1,4})?(?:\d{2,4}))')

It is giving me the correct result except for some text like:它给了我正确的结果,除了一些文本,如:

TEXT OUTPUT文字 OUTPUT

Lab: B12 969 2007\n 12 969 #should give 2007实验室:B12 969 2007\n 12 969 #应该给 2007

for 35 years, sold in 1985\n 35 #should give 1985 35年,1985年卖\n 35#应该给1985年

x 14 yrs who died i... 14 #should not be considered x 14 岁去世的人... 14 #不应该考虑

I tried to change the extract code to我试图将提取代码更改为

r'((?:\d{1,2}[/ ])?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec[a-z.,]*[- ])?(?:\d{1,2}[a-z-, ]{1,4})?(?:[/]\d{2})?(?:\d{4})?)' 

But with this entire result got bad但是整个结果变得很糟糕

The problem with your regex is that its constituents are all optional and it matches numbers that are not actually related to dates.您的正则表达式的问题在于它的成分都是可选的,并且它匹配实际上与日期无关的数字。 You need to build a regex with obligatory parts to avoid matching arbitrary parts.您需要构建一个带有强制性部分的正则表达式,以避免匹配任意部分。

And this is tricky: there are different types of dates in your sample input.这很棘手:您的示例输入中有不同类型的日期。 For those inputs, I'd recommend:对于这些输入,我建议:

(?<!\d)((?<!\d[ \t])(?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?)(?:-\d{1,2}-\d{4}|(?:\.?\s*\d{1,2}(?:st|[rn]d|th)?,?)?\s*\d{4})|\d{1,2}\s+(?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?)\.?,?\s*\d{4}|(?:\d{1,2}/)?\d{1,2}/\d{2}(?:\d{2})?|(?:19|20)\d{2})(?!\d)

See the regex demo .请参阅正则表达式演示 It matches:它匹配:

  • (?<!\d) - a negative lookbehind: no digit is allowed immediately to the left of the current location (?<!\d) - 否定的向后看:当前位置左侧不允许有数字
  • ( - start of the outer capturing group (necessary for .str.extract ) ( - 外部捕获组的开始(对于.str.extract是必需的)
    • (?<!\d[ \t]) - no digit followed with space or tab immediately to the left of the current location is allowed (?<!\d[ \t]) - 不允许在当前位置左侧紧跟空格或制表符的数字
    • (?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?) - names of months with their abbreviations (?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?) - 月份名称及其缩写
    • (?:-\d{1,2}-\d{4}|(?:\.?\s*\d{1,2}(?:st|[rn]d|th)?,?)?\s*\d{4}) - either of the two alternatives: (?:-\d{1,2}-\d{4}|(?:\.?\s*\d{1,2}(?:st|[rn]d|th)?,?)?\s*\d{4}) - 两种选择之一:
      • -\d{1,2}-\d{4} - - , 1 or 2 digits, - and then 4 digits -\d{1,2}-\d{4} - - , 1 或 2 位, -然后 4 位
      • | - or - 或者
      • (?:\.?\s*\d{1,2}(?:st|[rn]d|th)?,?)? - an optional non-capturing group that matches 1 or 0 occurrences of: - 一个可选的非捕获组,匹配 1 次或 0 次出现:
      • \.? - an optional . - 一个可选的.
      • \s* - 0+ whitespaces \s* - 0+ 个空格
      • \d{1,2} - 1 or 2 digits \d{1,2} - 1 或 2 位数字
      • (?:st|[rn]d|th)? - an optional sequence of chars: st , r or n followed with d , or th - 可选的字符序列: strn后跟dth
      • ,? - an optional comma - 可选逗号
      • \s*\d{4} - 0+ whitespaces and then 4 digits \s*\d{4} - 0+ 个空格,然后是 4 位数字
  • | - or - 或者
    • \d{1,2}\s+ - 1 or 2 digits and then 1+ whitespaces \d{1,2}\s+ - 1 或 2 位数字,然后 1+ 空格
    • (?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?) - names of months with their abbreviations (same as above) (?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?) - 月份名称及其缩写(同上)
    • \.? - an optional dot - 一个可选的点
    • ,? - an optional comma - 可选逗号
    • \s* - 0+ whitespaces \s* - 0+ 个空格
    • \d{4} - four digits \d{4} - 四位数字
  • | - or - 或者
    • (?:\d{1,2}/)? - an optional sequence of 1 or 2 digits and then / - 可选的 1 位或 2 位数字序列,然后/
    • \d{1,2} - 1 or 2 digits \d{1,2} - 1 或 2 位数字
    • / - / / - /
    • \d{2}(?:\d{2})? - 2 digits and an optional sequence of 2 digits (it allows 2 or 4 digits but not 3) - 2 位数字和可选的 2 位数字序列(允许 2 位或 4 位数字,但不允许 3 位)
  • | - or - 或者
    • (?:19|20) - 19 or 20 (?:19|20) - 1920
    • \d{2} - two digits \d{2} - 两位数
  • ) - end of the puter caprturing group ) - puter 捕获组结束
  • (?!\d) - a negative lookahead: no digit is allowed immediately to the right of the current location. (?!\d) - 负前瞻:当前位置右侧不允许有数字。

In Python, you may define blocks for the pattern and build it dynamically:在 Python 中,您可以为模式定义块并动态构建它:

months = r'(?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?)'
pattern = rf'(?<!\d)((?<!\d[ \t]){months}(?:-\d{{1,2}}-\d{{4}}|(?:\.?\s*\d{{1,2}}(?:st|[rn]d|th)?,?)?\s*\d{{4}})|\d{{1,2}}\s+{months}\.?,?\s*\d{{4}}|(?:\d{{1,2}}/)?\d{{1,2}}/\d{{2}}(?:\d{{2}})?|(?:19|20)\d{{2}})(?!\d)'

Try using pandas.to_datetime() , it converts the most common date formats to datetime objects.尝试使用pandas.to_datetime() ,它将最常见的日期格式转换为日期时间对象。

Try this pattern.试试这个模式。 My suggestion is, you should decompose the problem into pieces and try to match one pattern at a time.我的建议是,您应该将问题分解为多个部分,并尝试一次匹配一个模式。 Because the regex for this problem is a bit messy and hard to cover all the edge cases with one expression.因为这个问题的正则表达式有点混乱,很难用一个表达式覆盖所有的边缘情况。

I've included the sub regexps so you can refine them to validate the edge cases.我已经包含了子正则表达式,因此您可以改进它们以验证边缘情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM