如何在python中的特定关键字之前提取文本？

Question

import re
col4="""May god bless our families studied. CiteSeerX  2009-05-24 2007-11-19 2004"""
b=re.findall(r'\sCiteSeerX',col4)
print b

I have to print " May god bless our families studied ". 我必须打印“ 愿上帝保佑我们的家庭学习 ”。 I'm using pythton regular expressions to extract the file name but i'm only getting CiteSeerX as output.I'm doing this on a very large dataset so i only want to use regular expression if there is any other efficient and faster way please point out. 我正在使用pythton正则表达式提取文件名，但我只是将CiteSeerX作为输出。我正在非常大的数据集上执行此操作，所以我只想使用正则表达式，如果有其他高效快捷的方法，请使用指出。
Also I want the last year 2004 as a output. 我也希望将2004年作为输出。 I'm new to regular expressions and I now that my above implementation is wrong but I can't find a correct one. 我是正则表达式的新手，现在我上面的实现是错误的，但是我找不到正确的表达式。 This is a very naive question. 这是一个非常幼稚的问题。 I'm sorry and Thank you in advance. 对不起，谢谢你。

Answer 1

If the structure of all your data is similar to the sample you provided, this should get you going: 如果您所有数据的结构都与您提供的样本相似，那么您应该可以：

import re
data = re.findall("(.*?) CiteSeerX.*(\d{4})$", col4)
if data:
    # we have a match extract the first capturing group
    title, year = data[0]
    print(title, year)
else:
    print("Unable to parse the string")

# Output: May god bless our families studied. 2004

This snippet extracts everything before CiteSeerX as the title and the last 4 digits as the year (again, assuming that the structure is similar for all the data you have). 此代码段提取CiteSeerX之前的所有内容作为标题，并提取最后4位数字作为年份（同样，假设您拥有的所有数据的结构都相似）。 The brackets mark the capturing groups for the parts that we are interested in. 括号标记了我们感兴趣的部分的捕获组。

Update : For the case, where there is metadata following the year of publishing, use the following regular expression: 更新：对于发布年份之后存在元数据的情况，请使用以下正则表达式：

import re
YEAR = "\d{4}"
DATE = "\d\d\d\d-\d\d-\d\d"
def parse_citation(s):
    regex = "(.*?) CiteSeerX\s+{date} {date} ({year}).*$".format(date=DATE, year=YEAR)
    data = re.findall(regex, s)
    if data:
        # we have a match extract the first group
        return data[0]
    else:
        return None

c1 = """May god bless our families studied. CiteSeerX  2009-05-24 2007-11-19 2004"""
c2 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004 application/pdf text http //citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.1483 http //www.biomedcentral.com/content/pdf/1471-2350-5-20.pdf en Metadata may be used without restrictions as long as the oai identifier remains attached to it."""
print(parse_citation(c1))
print(parse_citation(c2))
# Output:
# ('May god bless our families studied.', '2004')
# ('May god bless our families studied.', '2004')

Answer 2

Here is an answer that doesn't use regex. 这是一个不使用正则表达式的答案。

>>> s = "now is the time for all good men"
>>> s.find("all")
20
>>> s[:20]
'now is the time for '
>>>

如何在python中的特定关键字之前提取文本？

问题描述

2 个解决方案

解决方案1
0 已采纳 2016-02-24 08:08:44

解决方案2
0 2016-03-15 14:22:15

如何在python中的特定关键字之前提取文本？

问题描述

2 个解决方案

解决方案1 0 已采纳 2016-02-24 08:08:44

解决方案2 0 2016-03-15 14:22:15

解决方案1
0 已采纳 2016-02-24 08:08:44

解决方案2
0 2016-03-15 14:22:15