I have a dataframe which looks like this:
description
1906 RES 330 ML
1906 RES 330ML
RES 335 c/6
RES 332 c/12
I want to extract the three consecutive digits of numbers and save it in a new column 'volume'. My code is like this:
df['volume'] = df['description'].str.extract('([([\d]*[\d]){3,3}?])')
EXPECTED RESULTS SHOULD BE LIKE THIS:
volume
330
330
335
332
However, it gives the results like this:
volume
1906
1906
335
332
Can anyone help me fix this code? Thanks so much!!!
Might be overkill, but if you want to make sure you don't capture numbers that are part of 4 digit numbers, you might use this:
df['volume'] = df.description.str.extract(r'(?<!\d)(\d{3})(?!\d)', expand=False)
print(df)
description volume
0 1906 RES 330 ML 330
1 1906 RES 330ML 330
2 RES 335 c/6 335
3 RES 332 c/12 332
Specify expand=False
, so that matches are returned as one pd.Series
only.
The regex:
(?<!\\d)
- specifies that anything before a set of 3 digits is something that is not a digit (\\d{3})
- matches 3 digits (?!\\d)
- specifies that anything after a set of 3 digits is something that is not a digit You need to
[\\d]*
\\b
?
[]
You do not need to:
()
This regex will find exactly three digits, alone:
\b(\d{3})\b
The regex you are looking for is \\b[\\d]{3}\\b
for more information on \\b
see docs
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.