简体   繁体   English

正则表达式从 pandas 列中的字符串中提取数据

[英]regex to extract data from a string in pandas column

help me to write a regex to process the string in rawDim to extract the height, width and the depth (as float64 integers).帮助我编写一个正则表达式来处理 rawDim 中的字符串以提取高度、宽度和深度(作为 float64 整数)。

Bonus: Is there a single regex for all 5 examples?奖励:所有 5 个示例都有一个正则表达式吗?

import pandas as pd
dim_df = pd.read_csv("dim_df_correct.csv")
dim_df

    rawDim          height  width   depth
0   19×52cm         19.0    52.0    NaN
1   50 x 66,4 cm    50.0    66.4    NaN
2   168.9 x 274.3 x 3.8 cm (66 1/2 x 108 x 1 1/2 in.)   168.9   274.3   3.8
3   Sheet: 16 1/4 × 12 1/4 in. (41.3 × 31.1 cm) Im...   35.6    25.1    NaN
4   5 by 5in    12.7    12.7    NaN


import re
import pandas as pd

You can use '(?P<height>[\d.]+)\s*(?:[x×]|by)\s*(?P<width>[\d.]+)\s*(?:[x×]\s*(?P<depth>[\d.]+))?'您可以使用'(?P<height>[\d.]+)\s*(?:[x×]|by)\s*(?P<width>[\d.]+)\s*(?:[x×]\s*(?P<depth>[\d.]+))?' :

df[['rawDim']].join(
   df['rawDim'].str.replace(r'(\d+),', r'\1.', regex=True)
   .str.extract(r'(?P<height>[\d.]+)\s*(?:[x×]|by)\s*(?P<width>[\d.]+)\s*(?:[x×]\s*(?P<depth>[\d.]+))?')
   .astype(float)
   )

output: output:

                                              rawDim  height  width  depth
0                                            19×52cm    19.0   52.0    NaN
1                                       50 x 66,4 cm    50.0   66.4    NaN
2  168.9 x 274.3 x 3.8 cm (66 1/2 x 108 x 1 1/2 in.)   168.9  274.3    3.8
3  Sheet: 16 1/4 × 12 1/4 in. (41.3 × 31.1 cm) Im...     4.0   12.0    NaN
4                                           5 by 5in     5.0    5.0    NaN

NB.注意。 add \s*cm\b to ensure cm only添加\s*cm\b以确保仅 cm

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM