简体   繁体   English

去年发生的字符串

[英]last year occurrence from string

I have strings like this: 我有这样的字符串:

ACB 01900 X1911D 1910 1955-2011 3424 2135 1934 foobar

I'm trying to get the last occurrence of a single year (from 1900 to 2050), so I need to extract only 1934 from that string. 我试图获取单个年份的最后一次出现(从1900到2050),所以我只需要从该字符串中提取1934

I'm trying with: 我正在尝试:

 grep -P -o '\s(19|20)[0-9]{2}\s(?!\s(19|20)[0-9]{2}\s)'

or 要么

grep -P -o '((19|20)[0-9]{2})(?!\s\1\s)'

But it matches: 1910 and 1934 但它匹配:1910年和1934年

Here's the Regex101 example: 这是Regex101示例:

https://regex101.com/r/UetMl0/3 https://regex101.com/r/UetMl0/3

https://regex101.com/r/UetMl0/4 https://regex101.com/r/UetMl0/4

Plus: how can I extract the year without the surrounding spaces without doing an extra grep to filter them? 另外:如何在没有周围空间的情况下提取年份而不进行额外的grep过滤?

I don't see a way to do this with grep because it doesn't let you output just one of the capture groups, only the whole match. 我看不到使用grep进行此操作的方法,因为它不允许您仅输出一个捕获组,而仅输出整个匹配项。

Wit perl I'd do something like 知道我会做些什么

perl -lpe 'if (/^.*\b(19\d\d|20(?:0-4\d|50))\b/) { print $1 }'

Idea: Use ^.* (greedy) to consume as much of the string up front as possible, thus finding the last possible match. 想法:使用^.* (贪婪)在前面尽可能多地消耗字符串,从而找到最后一个可能的匹配项。 Use \\b (word boundary) around the matched number to prevent matching 01900 or X1911D . 在匹配的数字周围使用\\b (单词边界)以防止匹配01900X1911D Only print the first capture group ( $1 ). 仅打印第一个捕获组( $1 )。

I tried to implement your requirement of 1900-2050; 我试图执行您的1900-2050年要求; if that's too complicated, ((?:19|20)\\d\\d) will do (but also match eg 2099). 如果太复杂, ((?:19|20)\\d\\d)可以(但也可以匹配例如2099)。

The regex to do your task using grep can be as follows: 使用grep执行任务的正则表达式可以如下:

\b(?:19\d{2}|20[0-4]\d|2050)\b(?!.*\b(?:19\d{2}|20[0-4]\d|2050)\b)

Details: 细节:

  • \\b - Word boundary. \\b字边界。
  • (?: - Start of a non-capturing group, needed as a container for alternatives. (?: -非捕获组的开始,需要作为替代容器。
    • 19\\d{2}| - The first alternative (1900 - 1999). -第一个替代方案(1900年-1999年)。
    • 20[0-4]\\d| - The second alternative (2000 - 2049). -第二种选择(2000年-2049年)。
    • 2050 - The third alternative, just 2050. 2050年-第三种选择,仅2050年。
  • ) - End of the non-capturing group. ) -非捕获组的结尾。
  • \\b - Word boundary. \\b字边界。
  • (?! - Negative lookahead for: (?! -负前瞻:
    • .* - A sequence of any chars, meaning actually "what follows can occur anywhere further". .* -任何字符的序列,实际上意味着“后续操作可能在任何其他地方发生”。
    • \\b(?:19\\d{2}|20[0-4]\\d|2050)\\b - The same expression as before. \\b(?:19\\d{2}|20[0-4]\\d|2050)\\b与以前相同的表达式。
  • ) - End of the negative lookahead. ) -否定前瞻的结尾。

The word boundary anchors provide that you will not match numbers - parts of longer words, eg X1911D . 单词边界锚点规定您将不匹配数字- 较长单词的一部分,例如X1911D

The negative lookahead provides that you will match just the last occurrence of the required year. 否定的前瞻性表示您将仅匹配所需年份的最后一次出现。

If you can use other tool than grep , supporting call to a previous numbered group (?n) , where n is the number of another capturing group, the regex can be a bit simpler: 如果您可以使用grep以外的其他工具来支持对上一个编号组(?n)调用,其中n是另一个捕获组的编号,则正则表达式可能会更简单:

(\b(?:19\d{2}|20[0-4]\d|2050)\b)(?!.*(?1))

Details: 细节:

  • (\\b(?:19\\d{2}|20[0-4]\\d|2050)\\b) - The regex like before, but enclosed within a capturing group (it will be "called" later). (\\b(?:19\\d{2}|20[0-4]\\d|2050)\\b) -正则表达式与以前一样,但是包含在捕获组中(稍后将被称为)。
  • (?!.*(?1)) - Negative lookahead for capturing group No 1, located anywhere further. (?!.*(?1)) -捕获组1的负向超前位置。

This way you avoid writing the same expression again. 这样您可以避免再次编写相同的表达式。

For a working example in regex101 see https://regex101.com/r/fvVnZl/1 有关regex101的工作示例,请参见https://regex101.com/r/fvVnZl/1

You may use a PCRE regex without any groups to only return the last occurrence of a pattern you need if you prepend the pattern with ^.*\\K , or, in your case, since you expect a whitespace boundary, ^(?:.*\\s)?\\K : 如果在模式前面加上^.*\\K ,或者由于您希望有空白边界^(?:.*\\s)?\\K ,则可以不带任何组使用PCRE正则表达式,仅返回最后一次出现的所需模式^(?:.*\\s)?\\K

grep -Po '^(?:.*\s)?\K(?:19\d{2}|20(?:[0-4]\d|50))(?!\S)' file

See the regex demo . 参见regex演示

Details 细节

  • ^ - start of line ^ -行首
  • (?:.*\\s)? - an optional non-capturing group matching 1 or 0 occurrences of -匹配1个或0个匹配项的可选非捕获组
    • .* - any 0+ chars other than line break chars, as many as possible .* -尽可能多的除换行符以外的0+个字符
    • \\s - a whitespace char \\s空格字符
  • \\K - match reset operator discarding the text matched so far \\K匹配重置运算符丢弃到目前为止已匹配的文本
  • (?:19\\d{2}|20(?:[0-4]\\d|50)) - 19 and any two digits or 20 followed with either a digit from 0 to 4 and then any digit ( 00 to 49 ) or 50 . (?:19\\d{2}|20(?:[0-4]\\d|50)) - 19和任意两位数字或20后跟一个04的数字,然后是任意数字( 0049 )或50
  • (?!\\S) - a whitespace or end of string. (?!\\S) -空格或字符串结尾。

See an online demo : 观看在线演示

s="ACB 01900 X1911D 1910 1955-2011 3424 2135 1934 foobar"
grep -Po '^(?:.*\s)?\K(?:19\d{2}|20(?:[0-4]\d|50))(?!\S)' <<< "$s"
# => 1934

Have you ever heard this saying : 你有没有听过这样的话

Some people, when confronted with a problem, think
“I know, I'll use regular expressions.”   Now they have two problems. 

Keep it simple - you're interested in finding a number between 2 numbers so just use a numeric comparison, not a regexp: 保持简单-您有兴趣寻找2个数字之间的数字,因此只使用数字比较而不是正则表达式即可:

$ awk -v min=1900 -v max=2050 '{yr=""; for (i=1;i<=NF;i++) if ( ($i ~ /^[0-9]{4}$/) && ($i >= min) && ($i <= max) ) yr=$i; print yr}' file
1934

You didn't say what to do if no date within your range is present so the above outputs a blank line if that happens but is easily tweaked to do anything else. 您没有说如果您的范围内没有日期,该怎么办,因此如果发生这种情况,上面的内容将输出一个空白行,但很容易进行其他操作。

To change the above script to find the first instead of the last date is trivial (move the print inside the if), to use different start or end dates in your range is trivial (change the min and/or max values), etc., etc. which is a strong indication that this is the right approach. 要更改上述脚本以查找第一个日期而不是最后一个日期是微不足道的(将打印移至if中),使用范围内不同的开始或结束日期是微不足道的(更改最小值和/或最大值),等等。等等,这强烈表明这是正确的方法。 Try changing any of those requirements with a regexp-based solution. 尝试使用基于正则表达式的解决方案更改任何这些要求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM