简体   繁体   中英

How to extract text before a specific keyword in python?

import re
col4="""May god bless our families studied. CiteSeerX  2009-05-24 2007-11-19 2004"""
b=re.findall(r'\sCiteSeerX',col4)
print b

I have to print " May god bless our families studied ". I'm using pythton regular expressions to extract the file name but i'm only getting CiteSeerX as output.I'm doing this on a very large dataset so i only want to use regular expression if there is any other efficient and faster way please point out.
Also I want the last year 2004 as a output. I'm new to regular expressions and I now that my above implementation is wrong but I can't find a correct one. This is a very naive question. I'm sorry and Thank you in advance.

If the structure of all your data is similar to the sample you provided, this should get you going:

import re
data = re.findall("(.*?) CiteSeerX.*(\d{4})$", col4)
if data:
    # we have a match extract the first capturing group
    title, year = data[0]
    print(title, year)
else:
    print("Unable to parse the string")

# Output: May god bless our families studied. 2004

This snippet extracts everything before CiteSeerX as the title and the last 4 digits as the year (again, assuming that the structure is similar for all the data you have). The brackets mark the capturing groups for the parts that we are interested in.

Update : For the case, where there is metadata following the year of publishing, use the following regular expression:

import re
YEAR = "\d{4}"
DATE = "\d\d\d\d-\d\d-\d\d"
def parse_citation(s):
    regex = "(.*?) CiteSeerX\s+{date} {date} ({year}).*$".format(date=DATE, year=YEAR)
    data = re.findall(regex, s)
    if data:
        # we have a match extract the first group
        return data[0]
    else:
        return None

c1 = """May god bless our families studied. CiteSeerX  2009-05-24 2007-11-19 2004"""
c2 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004 application/pdf text http //citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.1483 http //www.biomedcentral.com/content/pdf/1471-2350-5-20.pdf en Metadata may be used without restrictions as long as the oai identifier remains attached to it."""
print(parse_citation(c1))
print(parse_citation(c2))
# Output:
# ('May god bless our families studied.', '2004')
# ('May god bless our families studied.', '2004')

Here is an answer that doesn't use regex.

>>> s = "now is the time for all good men"
>>> s.find("all")
20
>>> s[:20]
'now is the time for '
>>> 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM