extracting dates using Regex in python

Question

I want to extract year from my Data Frame column data3['CopyRight'] .

CopyRight
2015 Sony Music Entertainment
2015 Ultra Records , LLC under exclusive license
2014 , 2015 Epic Records , a division of Sony Music Entertainment
Compilation ( P ) 2014 Epic Records , a division of Sony Music Entertainment
2014 , 2015 Epic Records , a division of Sony Music Entertainment
2014 , 2015 Epic Records , a division of Sony Music Entertainment

I am using the below code to extract the year :

data3['CopyRight_year'] = data3['CopyRight'].str.extract('([0-9]+)', expand=False).str.strip()

with my Code I am only getting the First occurrence of year.

CopyRight_year
2015
2015
2014
2014
2014
2014

I want to extract all the years mentioned in the column.

Expected Output

CopyRight_year
    2015
    2015
    2014,2015
    2014
    2014,2015
    2014,2015

Answer 1

Your current regex will just capture the digit, and if you want to capture the comma separated years, then you will need to enhance your regex to this,

[0-9]+(?:\s+,\s+[0-9]+)*

This regex [0-9]+ will match the numbers and additionally (?:\\s+,\\s+[0-9]+)* regex will match one or more whitespace followed by a comma and again followed by one or more whitespace and then finally a number and whole of it zero or more times as available in the data.

Demo

Change your panda dataframe line to this,

data3['CopyRight_year'] = data3['CopyRight'].str.extract('([0-9]+(?:\s+,\s+[0-9]+)*)', expand=False).str.replace('\s+','')

Prints,

                                           CopyRight CopyRight_year
0                      2015 Sony Music Entertainment           2015
1   2015 Ultra Records , LLC under exclusive license           2015
2  2014 , 2015 Epic Records , a 1999 division of ...      2014,2015
3  Compilation ( P ) 2014 Epic Records , a divisi...           2014
4  2014 , 2015 Epic Records , a division of Sony ...      2014,2015
5  2014 , 2015 Epic Records , a division of Sony ...      2014,2015

Although I liked jezrael answer which uses findall and join which gives you more flexibility and cleaner approach.

Answer 2

Use findall with regex for find all integers with length 4 to lists and last join it by separator:

Thank you @Wiktor Stribiżew for idea add word boundary r'\\b\\d{4}\\b' :

data3['CopyRight_year'] = data3['CopyRight'].str.findall(r'\b\d{4}\b').str.join(',')
print (data3)
                                           CopyRight CopyRight_year
0                      2015 Sony Music Entertainment           2015
1   2015 Ultra Records , LLC under exclusive license           2015
2  2014 , 2015 Epic Records , a division of Sony ...      2014,2015
3  Compilation ( P ) 2014 Epic Records , a divisi...           2014
4  2014 , 2015 Epic Records , a division of Sony ...      2014,2015
5  2014 , 2015 Epic Records , a division of Sony ...      2014,2015

extracting dates using Regex in python

Question

2 answers

solution1
1 2019-02-24 08:57:45

solution2
1 ACCPTED 2019-02-24 08:58:31

extracting dates using Regex in python

Question

2 answers

solution1 1 2019-02-24 08:57:45

solution2 1 ACCPTED 2019-02-24 08:58:31

solution1
1 2019-02-24 08:57:45

solution2
1 ACCPTED 2019-02-24 08:58:31