简体   繁体   English

使用 ^ 匹配 Python 正则表达式中的行首

[英]Using ^ to match beginning of line in Python regex

I'm trying to extract publication years ISI-style data from the Thomson-Reuters Web of Science.我正在尝试从 Thomson-Reuters Web of Science 中提取出版年 ISI 风格的数据。 The line for "Publication Year" looks like this (at the very beginning of a line): “出版年”这一行看起来像这样(在一行的开头):

PY 2015

For the script I'm writing I have defined the following regex function:对于我正在编写的脚本,我定义了以下正则表达式函数:

import re
f = open('savedrecs.txt')
wosrecords = f.read()

def findyears():
    result = re.findall(r'PY (\d\d\d\d)', wosrecords)
    print result

findyears()

This, however, gives false positive results because the pattern may appear elsewhere in the data.然而,这会产生假阳性结果,因为该模式可能出现在数据的其他地方。

So, I want to only match the pattern at the beginning of a line.所以,我只想匹配一行开头的模式。 Normally I would use ^ for this purpose, but r'^PY (\d\d\d\d)' fails at matching my results.通常我会为此目的使用^ ,但r'^PY (\d\d\d\d)'无法匹配我的结果。 On the other hand, using \n seems to do what I want, but that might lead to further complications for me.另一方面,使用\n似乎可以做我想做的事,但这可能会给我带来更多的麻烦。

re.findall(r'^PY (\d\d\d\d)', wosrecords, flags=re.MULTILINE)

应该管用

You can simply add (?m) inline modifier flag to the start of the pattern:您可以简单地将(?m)内联修饰符标志添加到模式的开头:

(?m)^PY\s+(\d{4})
^^^^

Do not confuse with (?s) !不要与(?s)混淆 (?s) is a DOTALL inline flag that makes . (?s)是一个 DOTALL 内联标志,它使. match any characters including line break characters.匹配任何字符,包括换行符。

Alternatively, you can use re.search with re.M or re.MULTILINE option :或者,您可以将re.searchre.Mre.MULTILINE选项一起使用:

import re
p = re.compile(r'^PY\s+(\d{4})', re.M)
test_str = "PY123\nPY 2015\nPY 2017"
print(re.findall(p, test_str)) 

See an IDEONE demo .查看IDEONE 演示

EXPLANATION :解释

  • ^ - Start of a line (due to re.M ) ^ - 一行的开始(由于re.M
  • PY - Literal PY PY - 文字PY
  • \s+ - 1 or more whitespace \s+ - 1 个或多个空格
  • (\d{4}) - Capture group holding 4 digits (\d{4}) - 捕获组持有 4 个数字

In this particular case there is no need to use regular expressions, because the searched string is always 'PY' and is expected to be at the beginning of the line, so one can use string.find for this job.在这种特殊情况下,不需要使用正则表达式,因为搜索到的字符串始终是 'PY' 并且应该位于行首,因此可以使用string.find来完成这项工作。 The find function returns the position the substring is found in the given string or line, so if it is found at the start of the string the returned value is 0 (-1 if not found at all), ie.: find函数返回子字符串在给定字符串或行中的位置,因此如果在字符串的开头找到它,则返回值为 0(如果根本没有找到,则返回 -1),即:

In [12]: 'PY 2015'.find('PY')
Out[12]: 0

In [13]: ' PY 2015'.find('PY')
Out[13]: 1

Perhaps it could be a good idea to strip the white spaces, ie.:也许去除空白可能是个好主意,即:

In [14]: '  PY 2015'.find('PY')
Out[14]: 2

In [15]: '  PY 2015'.strip().find('PY')
Out[15]: 0

And next if only the year is of interest it can be extracted with split, ie.:接下来,如果只对年份感兴趣,则可以使用 split 提取它,即:

In [16]: '  PY 2015'.strip().split()[1]
Out[16]: '2015'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM