简体   繁体   中英

Python - extracting digit from a pandas series containing text

This question has a reference ( here ). I am quite new to Python and thus getting stuck in somewhat trivial issues!!! I have a data series as follows

         Text
0        some texts...qualifications: BE year of passing 2012
1        MCOM from XYZ University in 2007. In 2009 he obtained his MBA 
2        Academics: University / Board: XYZ University   year of passing:2014

Objective is to extract the years as mentioned (only the first ones) ie 2012,2007,2014 . Now my approach is as follows:

corpus = pd.Series('the above series')
corpus = corpus.str.replace(r'^[A-Za-z0-9]+')
corpus = corpus.str.lower()
if corpus.str.contains('qualifications').any():
    corpus.str.extract('.*qualifications.*?(\d{4})', expand = False)
if corpus.str.contains('university').any():
    corpus.str.extract('.*university. *?(d\{4})', expand=False)
if corpus.str.contains('academics').any():
    corpus.str.extract('.*academics. *?(d\{4})',expand=False)

The above approach is creating a blank series. Kindly help me in solving this.

I think you can simplify that expression to simply this:

Code:

corpus = corpus.str.lower().str.extract(
    '(university|academics|qualifications).*?(\d{4})', expand=False)

Test Code:

corpus = pd.Series("""
    some texts...qualifications: BE year of passing 2012
    MCOM from XYZ University in 2007. In 2009 he obtained his MBA 
    Academics: University / Board: XYZ University   year of passing:2014
    """.split('\n')[1:-1], name='Text')

corpus = corpus.str.lower().str.extract(
    '(university|academics|qualifications).*?(\d{4})', expand=False)

print(corpus)

Results:

                0     1
0  qualifications  2012
1      university  2007
2       academics  2014

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM