[英]regex to extract data from a string in pandas column
help me to write a regex to process the string in rawDim to extract the height, width and the depth (as float64 integers).帮助我编写一个正则表达式来处理 rawDim 中的字符串以提取高度、宽度和深度(作为 float64 整数)。
Bonus: Is there a single regex for all 5 examples?奖励:所有 5 个示例都有一个正则表达式吗?
import pandas as pd
dim_df = pd.read_csv("dim_df_correct.csv")
dim_df
rawDim height width depth
0 19×52cm 19.0 52.0 NaN
1 50 x 66,4 cm 50.0 66.4 NaN
2 168.9 x 274.3 x 3.8 cm (66 1/2 x 108 x 1 1/2 in.) 168.9 274.3 3.8
3 Sheet: 16 1/4 × 12 1/4 in. (41.3 × 31.1 cm) Im... 35.6 25.1 NaN
4 5 by 5in 12.7 12.7 NaN
import re
import pandas as pd
You can use '(?P<height>[\d.]+)\s*(?:[x×]|by)\s*(?P<width>[\d.]+)\s*(?:[x×]\s*(?P<depth>[\d.]+))?'
您可以使用
'(?P<height>[\d.]+)\s*(?:[x×]|by)\s*(?P<width>[\d.]+)\s*(?:[x×]\s*(?P<depth>[\d.]+))?'
: :
df[['rawDim']].join(
df['rawDim'].str.replace(r'(\d+),', r'\1.', regex=True)
.str.extract(r'(?P<height>[\d.]+)\s*(?:[x×]|by)\s*(?P<width>[\d.]+)\s*(?:[x×]\s*(?P<depth>[\d.]+))?')
.astype(float)
)
output: output:
rawDim height width depth
0 19×52cm 19.0 52.0 NaN
1 50 x 66,4 cm 50.0 66.4 NaN
2 168.9 x 274.3 x 3.8 cm (66 1/2 x 108 x 1 1/2 in.) 168.9 274.3 3.8
3 Sheet: 16 1/4 × 12 1/4 in. (41.3 × 31.1 cm) Im... 4.0 12.0 NaN
4 5 by 5in 5.0 5.0 NaN
NB.注意。 add
\s*cm\b
to ensure cm only添加
\s*cm\b
以确保仅 cm
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.