如何从 python 中的 DataFrame 列中提取数字？

Question

Recently I was working on a Data cleaning assignment, where I used age_of_marriage dataset.最近我在做一个数据清理任务，我在其中使用了 age_of_marriage 数据集。 I started to clean data, but in the dataset there is a "height" column which is of Object type.我开始清理数据，但在数据集中有一个 Object 类型的“高度”列。 It is in the format of feet and inch.它采用英尺和英寸的格式。 数据集图像

I want to extract 'foot' and 'inch' from the data and convert it into 'cm' using the formula.我想从数据中提取“英尺”和“英寸”并使用公式将其转换为“厘米”。 I have the formula ready for the conversion but I am not able to extract it.我已经准备好用于转换的公式，但我无法提取它。 Also I want to convert it into Int datatype before applying the formula.我还想在应用公式之前将其转换为 Int 数据类型。 I am stuck on this mode.我被困在这种模式下。

-------- 2 height 2449 non-null object -------- ------ 2 高度 2449 非空 object ------

I am trying to extract it using String manipulation, but not able to do it.我正在尝试使用字符串操作来提取它，但无法做到。 Can anybody help.任何人都可以帮忙。

height高度
5'3" 5'3"
5'4" 5'4"

I have attached a github link to access the dataset.我附上了一个 github 链接来访问数据集。 text文本

import numpy as np
import pandas as pd
from collections import Counter

agemrg = pd.read_csv('age_of_marriage_data.csv')

for height in range(len(height_list)):
    BrideGroomHeight = height_list[height].rstrip(height_list[height][-1])
    foot_int = int(BrideGroomHeight[0])
    inch_int = int(BrideGroomHeight[2:4])
    print(foot_int)
    print(inch_int)
    
    if height in ['nan']:
        continue

output - 
5
4
5
7
5
7
5
0
5
5
5
5
5
2
5
5
5
5
5
1
5
3
5
9
5
10
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_12772/2525694838.py in <module>
      1 for height in range(len(height_list)):
----> 2     BrideGroomHeight = height_list[height].rstrip(height_list[height][-1])
      3     foot_int = int(BrideGroomHeight[0])
      4     inch_int = int(BrideGroomHeight[2:4])
      5     print(foot_int)

AttributeError: 'float' object has no attribute 'rstrip'

There are some nan values, due to which I am not able to perform this operation.有一些 nan 值，因此我无法执行此操作。

Answer 1

You can use .split() to get the feet and inches portion.您可以使用.split()获取英尺和英寸部分。 If you are certain you only have to deal with a few NaN rows, then a simple version could be:如果你确定你只需要处理几个 NaN 行，那么一个简单的版本可以是：

df['height_feet'] = df['height'].dropna().apply(lambda x: str(x).split("'")[0])
df['height_inches'] = df['height'].dropna().apply(lambda x: str(x).split("'")[-1][0:-1])
df[['height', 'height_feet', 'height_inches']]

Basically, the feet portion is the first piece in the split, and the inches portion is the last piece in the split but without the last character.基本上，脚部分是拆分中的第一块，英寸部分是拆分中的最后一块但没有最后一个字符。

Output: Output：

>>> print(df[['height', 'height_feet', 'height_inches']])
     height height_feet height_inches
0      5'4"           5             4
1      5'7"           5             7
2      5'7"           5             7
3      5'0"           5             0
4      5'5"           5             5
...     ...         ...           ...
2562   5'3"           5             3
2563  5'11"           5            11
2564   5'3"           5             3
2565  4'11"           4            11
2566   5'2"           5             2

[2567 rows x 3 columns]

Answer 2

You can use str.extract :您可以使用str.extract ：

df['height2'] = df['height'].str.extract(r'''(?P<ft>\d*)'(?P<in>\d+)"''') \
                            .astype(float).mul([30.48, 2.54]).sum(axis=1)

Or str.split and str.strip :或者str.split和str.strip ：

df['height3'] = df['height'].str.rstrip('"').str.split("'", expand=True) \
                            .astype(float).mul([30.48, 2.54]).sum(axis=1)

Output: Output：

>>> df.filter(like='height')
     height  height2  height3
0      5'4"   162.56   162.56
1      5'7"   170.18   170.18
2      5'7"   170.18   170.18
3      5'0"   152.40   152.40
4      5'5"   165.10   165.10
...     ...      ...      ...
2562   5'3"   160.02   160.02
2563  5'11"   180.34   180.34
2564   5'3"   160.02   160.02
2565  4'11"   149.86   149.86
2566   5'2"   157.48   157.48

[2567 rows x 3 columns]

如何从 python 中的 DataFrame 列中提取数字？

问题描述

2 个解决方案

解决方案1
1 2022-03-26 08:49:51

解决方案2
0 2022-03-26 08:41:45

如何从 python 中的 DataFrame 列中提取数字？

问题描述

2 个解决方案

解决方案1 1 2022-03-26 08:49:51

解决方案2 0 2022-03-26 08:41:45

解决方案1
1 2022-03-26 08:49:51

解决方案2
0 2022-03-26 08:41:45