简体   繁体   English

如何在python中的特定关键字之前提取文本?

[英]How to extract text before a specific keyword in python?

import re
col4="""May god bless our families studied. CiteSeerX  2009-05-24 2007-11-19 2004"""
b=re.findall(r'\sCiteSeerX',col4)
print b

I have to print " May god bless our families studied ". 我必须打印“ 愿上帝保佑我们的家庭学习 ”。 I'm using pythton regular expressions to extract the file name but i'm only getting CiteSeerX as output.I'm doing this on a very large dataset so i only want to use regular expression if there is any other efficient and faster way please point out. 我正在使用pythton正则表达式提取文件名,但我只是将CiteSeerX作为输出。我正在非常大的数据集上执行此操作,所以我只想使用正则表达式,如果有其他高效快捷的方法,请使用指出。
Also I want the last year 2004 as a output. 我也希望将2004年作为输出。 I'm new to regular expressions and I now that my above implementation is wrong but I can't find a correct one. 我是正则表达式的新手,现在我上面的实现是错误的,但是我找不到正确的表达式。 This is a very naive question. 这是一个非常幼稚的问题。 I'm sorry and Thank you in advance. 对不起,谢谢你。

If the structure of all your data is similar to the sample you provided, this should get you going: 如果您所有数据的结构都与您提供的样本相似,那么您应该可以:

import re
data = re.findall("(.*?) CiteSeerX.*(\d{4})$", col4)
if data:
    # we have a match extract the first capturing group
    title, year = data[0]
    print(title, year)
else:
    print("Unable to parse the string")

# Output: May god bless our families studied. 2004

This snippet extracts everything before CiteSeerX as the title and the last 4 digits as the year (again, assuming that the structure is similar for all the data you have). 此代码段提取CiteSeerX之前的所有内容作为标题,并提取最后4位数字作为年份(同样,假设您拥有的所有数据的结构都相似)。 The brackets mark the capturing groups for the parts that we are interested in. 括号标记了我们感兴趣的部分的捕获组。

Update : For the case, where there is metadata following the year of publishing, use the following regular expression: 更新 :对于发布年份之后存在元数据的情况,请使用以下正则表达式:

import re
YEAR = "\d{4}"
DATE = "\d\d\d\d-\d\d-\d\d"
def parse_citation(s):
    regex = "(.*?) CiteSeerX\s+{date} {date} ({year}).*$".format(date=DATE, year=YEAR)
    data = re.findall(regex, s)
    if data:
        # we have a match extract the first group
        return data[0]
    else:
        return None

c1 = """May god bless our families studied. CiteSeerX  2009-05-24 2007-11-19 2004"""
c2 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004 application/pdf text http //citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.1483 http //www.biomedcentral.com/content/pdf/1471-2350-5-20.pdf en Metadata may be used without restrictions as long as the oai identifier remains attached to it."""
print(parse_citation(c1))
print(parse_citation(c2))
# Output:
# ('May god bless our families studied.', '2004')
# ('May god bless our families studied.', '2004')

Here is an answer that doesn't use regex. 这是一个不使用正则表达式的答案。

>>> s = "now is the time for all good men"
>>> s.find("all")
20
>>> s[:20]
'now is the time for '
>>> 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在关键字和日期前后提取文本 - How to extract text before and after a keyword and date 如何在使用python在文本中找到关键字后提取一些先词 - how to extract few before words after finding a keyword in text using python 如何使用 python 从特定关键字中提取有限的数据行 - How to extract limited lines of data from specific keyword using python 如何从 python 中的关键字开始并以不同关键字结尾的字符串中提取特定行? - How do I extract specific lines from a string starting from a keyword and ending at a different keyword in python? 如何在 Python 中的文本中提取行的特定部分 - How to extract specific part of a line in a text in Python 如何从python中的数组检查特定关键字的推文的文本 - How to check the text of a tweet for a specific keyword from an array in python Python:在关键字前后抓取文字 - Python: Grab text before and after a keyword 如何使用 python 从文本文件中提取特定文本段落? - How to extract specific text paragraphs from a Text file using python? 正则表达式提取特定文本前后的所有内容 - Regex extract everything after and before a specific text 从字符串中提取出现在关键字之前的单词/句子 - Python - Extract words/sentence that occurs before a keyword from a string - Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM