简体   繁体   中英

extracting dates using Regex in python

I want to extract year from my Data Frame column data3['CopyRight'] .

CopyRight
2015 Sony Music Entertainment
2015 Ultra Records , LLC under exclusive license
2014 , 2015 Epic Records , a division of Sony Music Entertainment
Compilation ( P ) 2014 Epic Records , a division of Sony Music Entertainment
2014 , 2015 Epic Records , a division of Sony Music Entertainment
2014 , 2015 Epic Records , a division of Sony Music Entertainment

I am using the below code to extract the year :

data3['CopyRight_year'] = data3['CopyRight'].str.extract('([0-9]+)', expand=False).str.strip()

with my Code I am only getting the First occurrence of year.

CopyRight_year
2015
2015
2014
2014
2014
2014

I want to extract all the years mentioned in the column.

Expected Output

CopyRight_year
    2015
    2015
    2014,2015
    2014
    2014,2015
    2014,2015

Your current regex will just capture the digit, and if you want to capture the comma separated years, then you will need to enhance your regex to this,

[0-9]+(?:\s+,\s+[0-9]+)*

This regex [0-9]+ will match the numbers and additionally (?:\\s+,\\s+[0-9]+)* regex will match one or more whitespace followed by a comma and again followed by one or more whitespace and then finally a number and whole of it zero or more times as available in the data.

Demo

Change your panda dataframe line to this,

data3['CopyRight_year'] = data3['CopyRight'].str.extract('([0-9]+(?:\s+,\s+[0-9]+)*)', expand=False).str.replace('\s+','')

Prints,

                                           CopyRight CopyRight_year
0                      2015 Sony Music Entertainment           2015
1   2015 Ultra Records , LLC under exclusive license           2015
2  2014 , 2015 Epic Records , a 1999 division of ...      2014,2015
3  Compilation ( P ) 2014 Epic Records , a divisi...           2014
4  2014 , 2015 Epic Records , a division of Sony ...      2014,2015
5  2014 , 2015 Epic Records , a division of Sony ...      2014,2015

Although I liked jezrael answer which uses findall and join which gives you more flexibility and cleaner approach.

Use findall with regex for find all integers with length 4 to lists and last join it by separator:

Thank you @Wiktor Stribiżew for idea add word boundary r'\\b\\d{4}\\b' :

data3['CopyRight_year'] = data3['CopyRight'].str.findall(r'\b\d{4}\b').str.join(',')
print (data3)
                                           CopyRight CopyRight_year
0                      2015 Sony Music Entertainment           2015
1   2015 Ultra Records , LLC under exclusive license           2015
2  2014 , 2015 Epic Records , a division of Sony ...      2014,2015
3  Compilation ( P ) 2014 Epic Records , a divisi...           2014
4  2014 , 2015 Epic Records , a division of Sony ...      2014,2015
5  2014 , 2015 Epic Records , a division of Sony ...      2014,2015

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM