简体   繁体   中英

Extracting the project number with python regex

Good day everyone. I want to extract the last digit numbers after the slash symbol from the project_name column. Currently, I'm working on it but have some issues as follow:

  1. How could I extract the last digit numbers after the slash symbol without getting the result with a square bracket in it? Because right now I have the code that almost works but the result always has a square bracket in it

My code:

def project_name(name):
    return re.findall(r'\d{3}$',name)

data['project_name'] = data['project_name'].apply(project_name)

The data:

project_name    
 ----------
   ASAHI,PT-PRO/PTN/06-2012/192          
   CIMB NIAGA-PRO/PTN/06-2012/174        
   FRAMAS INDONESIA-PRO/PTN/06-2012/210    
   DM STOCK 2015   
   PERBAIKAN OH TM 366 PLANT DAWUAN 
   Ruko-PRO/PTN/03-2012/47

My output:

 (Expected)project_name   
 ----------     
   192            
   174            
   210            
   NaN
   NaN            
   NaN            
    47            

All advice and input are appreciated. Thanks everyone

Use Series.str.extract and add / to regex:

data['project_name'] = data['project_name'].str.extract(r'/(\d{3}$)')
print (data)
  project_name
0          192
1          174
2          210
3          NaN
4          NaN
5          NaN
6          NaN

Solution with findall :

data['project_name'] = data['project_name'].str.findall(r'/(\d{3}$)').str[0]

And your solution should be change with next and iter for return default value np.nan if no match:

def project_name(name):
    return next(iter(re.findall(r'/(\d{3})$',name)), np.nan)

data['project_name'] = data['project_name'].apply(project_name)
print (data)
  project_name
0          192
1          174
2          210
3          NaN
4          NaN
5          NaN
6          NaN

instead of

def project_name(name):
    return re.findall(r'\d{3}$',name)

use

def project_name(name):
    return re.findall(r'\d{3}$',name)[0]

As the value in list is only one, we can return the value of 0th index

def project_name(name):
    return re.findall(r'\d{3}$',name)[0]

data['project_name'] = data['project_name'].apply(project_name)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM