[英]How to extract text before a specific keyword in python?
import re
col4="""May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
b=re.findall(r'\sCiteSeerX',col4)
print b
I have to print " May god bless our families studied ". 我必须打印“ 愿上帝保佑我们的家庭学习 ”。 I'm using pythton regular expressions to extract the file name but i'm only getting CiteSeerX as output.I'm doing this on a very large dataset so i only want to use regular expression if there is any other efficient and faster way please point out.
我正在使用pythton正则表达式提取文件名,但我只是将CiteSeerX作为输出。我正在非常大的数据集上执行此操作,所以我只想使用正则表达式,如果有其他高效快捷的方法,请使用指出。
Also I want the last year 2004 as a output. 我也希望将2004年作为输出。 I'm new to regular expressions and I now that my above implementation is wrong but I can't find a correct one.
我是正则表达式的新手,现在我上面的实现是错误的,但是我找不到正确的表达式。 This is a very naive question.
这是一个非常幼稚的问题。 I'm sorry and Thank you in advance.
对不起,谢谢你。
If the structure of all your data is similar to the sample you provided, this should get you going: 如果您所有数据的结构都与您提供的样本相似,那么您应该可以:
import re
data = re.findall("(.*?) CiteSeerX.*(\d{4})$", col4)
if data:
# we have a match extract the first capturing group
title, year = data[0]
print(title, year)
else:
print("Unable to parse the string")
# Output: May god bless our families studied. 2004
This snippet extracts everything before CiteSeerX
as the title and the last 4 digits as the year (again, assuming that the structure is similar for all the data you have). 此代码段提取
CiteSeerX
之前的所有内容作为标题,并提取最后4位数字作为年份(同样,假设您拥有的所有数据的结构都相似)。 The brackets mark the capturing groups for the parts that we are interested in. 括号标记了我们感兴趣的部分的捕获组。
Update : For the case, where there is metadata following the year of publishing, use the following regular expression: 更新 :对于发布年份之后存在元数据的情况,请使用以下正则表达式:
import re
YEAR = "\d{4}"
DATE = "\d\d\d\d-\d\d-\d\d"
def parse_citation(s):
regex = "(.*?) CiteSeerX\s+{date} {date} ({year}).*$".format(date=DATE, year=YEAR)
data = re.findall(regex, s)
if data:
# we have a match extract the first group
return data[0]
else:
return None
c1 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
c2 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004 application/pdf text http //citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.1483 http //www.biomedcentral.com/content/pdf/1471-2350-5-20.pdf en Metadata may be used without restrictions as long as the oai identifier remains attached to it."""
print(parse_citation(c1))
print(parse_citation(c2))
# Output:
# ('May god bless our families studied.', '2004')
# ('May god bless our families studied.', '2004')
Here is an answer that doesn't use regex. 这是一个不使用正则表达式的答案。
>>> s = "now is the time for all good men"
>>> s.find("all")
20
>>> s[:20]
'now is the time for '
>>>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.