简体   繁体   English

在Openrefine中匹配不同的日期

[英]Matching diverse dates in Openrefine

I am trying to use the value.match command in OpenRefine 2.6 for splitting the information presents in a column into (at least) 2 columns. 我正在尝试使用OpenRefine 2.6中的value.match命令将一列中的信息分成(至少)两列。 The data are, however, quite messed up. 但是,数据非常混乱。
I have sometimes full dates: 我有时有完整的约会:

May 30, 1949

Sometimes full dates are combined with other dates and attributes: 有时,完整日期会与其他日期和属性结合在一起:

May 30, 1949, published 1979
May 30, 1949 and 1951, published 1979
May 30, 1949, printed 1980
May 30, 1949, print executed 1988
May 30, 1949, prints executed 1988
published 1940

Sometimes you have timespan: 有时您有时间跨度:

1905-05 OR 1905-1906 1905-051905-1906

Sometimes only the year 有时只有一年

1905

Sometimes year with attributes 有时具有属性的年份

August or September 1908

Doesn't seems to follow any specific schema or order. 似乎没有遵循任何特定的架构或顺序。
I would like to extract (at least)ca start and end date year, in order to have two columns: 我想提取(至少)ca开始和结束日期年份,以便有两列:

-----------------------  
|start_date | end_date|  
|1905       | 1906    |   
-----------------------  

without the rest of the attributes. 没有其余的属性。

I can find the last date using 我可以使用找到最后的日期
value.match(/.*(\\d{4}).*?/)[0]
and the first one with 和第一个
value.match(/.*^(\\d{4}).*?/)[0]
but I got some trouble with the two formulas. 但是我对这两个公式有些麻烦。
The latter cannot match anything in case of: 在以下情况下,后者无法匹配任何内容:
May 30, 1949 and 1951, published 1979
while in the case of: 而在以下情况下:
Paris, winter 1911-12 The latter formula cannot match anything and the former formula match 1911 Paris, winter 1911-12Paris, winter 1911-12 ,后一个公式不能匹配任何东西,而前一个公式可以匹配1911年

Anyone know how I can resolve the problem? 有人知道我可以解决这个问题吗?
I would need a solution that take the first date as start_date and final date as end_date, or better (don't know if it is possible) earliest date as start_date and latest date as end_date. 我需要一个将第一个日期作为start_date并将最终日期作为end_date或更好(不知道是否可能)的最早日期作为start_date并将最新日期作为end_date的解决方案。 Moreover, I would be glad to have some clue about how to extract other information, such as if published or printed or executed is present in the text -> copy date to a new column name “execution”. 此外,我很高兴对如何提取其他信息有一些提示,例如文本->复制日期到新列名“ execution”中是否存在已发布打印执行的信息。 should be something like create a new column if(value.match("string1|string2|string3" + (\\d{4}), "perform the operation", do nothing) 应该类似于创建新列if(value.match("string1|string2|string3" + (\\d{4}), "perform the operation", do nothing)

value.match() is a very useful but sometimes tricky function. value.match()是一个非常有用但有时很棘手的函数。 To extract a pattern from a text, I prefer to use Python/Jython's regular expressions : 为了从文本中提取模式,我更喜欢使用Python / Jython的正则表达式:

import re

pattern = re.compile(r"\d{4}")

return pattern.findall(value)

From there, you can create a string with all the years concatenated: 在这里,您可以创建一个包含所有年份的字符串:

return ",".join(pattern.findall(value))

Or select only the first: 或仅选择第一个:

return pattern.findall(value)[0]

Or the last: 或最后一个:

return pattern.findall(value)[-1]

etc. 等等

Same thing for your sub-question: 您的子问题也一样:

import re

pattern = re.compile(r"(published|printed|executed)\s+(\d+)")

return pattern.findall(value)[0][1]

Or : 要么 :

import re

pattern = re.compile(r"(published|printed|executed)\s+(\d+)")

m = re.search(pattern, value)

return m.group(2)

Example: 例:

在此处输入图片说明

Here is a regex which will extract start_date and end_date in named groups : 这是一个正则表达式,它将在命名组中提取start_dateend_date

If there is only one date, then it consider it's the start_date : 如果只有一个日期,则认为它是start_date

((?<start_date>\\d{4}).*?)?(?<end_date>\\d{4}|(?<=-)\\d{2})?$

Demo 演示版

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM