简体   繁体   中英

How to extract certain length of numbers from a string in python? [duplicate]

I have a dataframe which looks like this:

description     
1906 RES 330 ML
1906 RES 330ML
RES 335 c/6
RES 332 c/12

I want to extract the three consecutive digits of numbers and save it in a new column 'volume'. My code is like this:

df['volume'] = df['description'].str.extract('([([\d]*[\d]){3,3}?])')

EXPECTED RESULTS SHOULD BE LIKE THIS:

volume
330
330
335
332

However, it gives the results like this:

volume
1906
1906
335
332

Can anyone help me fix this code? Thanks so much!!!

Might be overkill, but if you want to make sure you don't capture numbers that are part of 4 digit numbers, you might use this:

df['volume'] = df.description.str.extract(r'(?<!\d)(\d{3})(?!\d)', expand=False)    
print(df)

       description volume
0  1906 RES 330 ML    330
1   1906 RES 330ML    330
2      RES 335 c/6    335
3     RES 332 c/12    332

Specify expand=False , so that matches are returned as one pd.Series only.


The regex:

  • (?<!\\d) - specifies that anything before a set of 3 digits is something that is not a digit
  • (\\d{3}) - matches 3 digits
  • (?!\\d) - specifies that anything after a set of 3 digits is something that is not a digit

You need to

  • not match any number of digits, three times, so delete the [\\d]*
  • not match 3 digits within anything looking like a "word",
    especially not other digits, so use word boundary \\b
  • not allow optional ?
  • not overdo the character set thing []

You do not need to:

  • use two capture groups ()

This regex will find exactly three digits, alone:

\b(\d{3})\b

The regex you are looking for is \\b[\\d]{3}\\b

for more information on \\b see docs

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM