如何從 python 中的 DataFrame 列中提取數字？

Question

最近我在做一個數據清理任務，我在其中使用了 age_of_marriage 數據集。 我開始清理數據，但在數據集中有一個 Object 類型的“高度”列。 它采用英尺和英寸的格式。 數據集圖像

我想從數據中提取“英尺”和“英寸”並使用公式將其轉換為“厘米”。 我已經准備好用於轉換的公式，但我無法提取它。 我還想在應用公式之前將其轉換為 Int 數據類型。 我被困在這種模式下。

------ 2 高度 2449 非空 object ------

我正在嘗試使用字符串操作來提取它，但無法做到。 任何人都可以幫忙。

高度
5'3"
5'4"

我附上了一個 github 鏈接來訪問數據集。 文本

import numpy as np
import pandas as pd
from collections import Counter

agemrg = pd.read_csv('age_of_marriage_data.csv')

for height in range(len(height_list)):
    BrideGroomHeight = height_list[height].rstrip(height_list[height][-1])
    foot_int = int(BrideGroomHeight[0])
    inch_int = int(BrideGroomHeight[2:4])
    print(foot_int)
    print(inch_int)
    
    if height in ['nan']:
        continue

output - 
5
4
5
7
5
7
5
0
5
5
5
5
5
2
5
5
5
5
5
1
5
3
5
9
5
10
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_12772/2525694838.py in <module>
      1 for height in range(len(height_list)):
----> 2     BrideGroomHeight = height_list[height].rstrip(height_list[height][-1])
      3     foot_int = int(BrideGroomHeight[0])
      4     inch_int = int(BrideGroomHeight[2:4])
      5     print(foot_int)

AttributeError: 'float' object has no attribute 'rstrip'

有一些 nan 值，因此我無法執行此操作。

Answer 1

您可以使用.split()獲取英尺和英寸部分。 如果你確定你只需要處理幾個 NaN 行，那么一個簡單的版本可以是：

df['height_feet'] = df['height'].dropna().apply(lambda x: str(x).split("'")[0])
df['height_inches'] = df['height'].dropna().apply(lambda x: str(x).split("'")[-1][0:-1])
df[['height', 'height_feet', 'height_inches']]

基本上，腳部分是拆分中的第一塊，英寸部分是拆分中的最后一塊但沒有最后一個字符。

Output：

>>> print(df[['height', 'height_feet', 'height_inches']])
     height height_feet height_inches
0      5'4"           5             4
1      5'7"           5             7
2      5'7"           5             7
3      5'0"           5             0
4      5'5"           5             5
...     ...         ...           ...
2562   5'3"           5             3
2563  5'11"           5            11
2564   5'3"           5             3
2565  4'11"           4            11
2566   5'2"           5             2

[2567 rows x 3 columns]

Answer 2

您可以使用str.extract ：

df['height2'] = df['height'].str.extract(r'''(?P<ft>\d*)'(?P<in>\d+)"''') \
                            .astype(float).mul([30.48, 2.54]).sum(axis=1)

或者str.split和str.strip ：

df['height3'] = df['height'].str.rstrip('"').str.split("'", expand=True) \
                            .astype(float).mul([30.48, 2.54]).sum(axis=1)

Output：

>>> df.filter(like='height')
     height  height2  height3
0      5'4"   162.56   162.56
1      5'7"   170.18   170.18
2      5'7"   170.18   170.18
3      5'0"   152.40   152.40
4      5'5"   165.10   165.10
...     ...      ...      ...
2562   5'3"   160.02   160.02
2563  5'11"   180.34   180.34
2564   5'3"   160.02   160.02
2565  4'11"   149.86   149.86
2566   5'2"   157.48   157.48

[2567 rows x 3 columns]

如何從 python 中的 DataFrame 列中提取數字？

問題描述

2 個解決方案

解決方案1
1 2022-03-26 08:49:51

解決方案2
0 2022-03-26 08:41:45

如何從 python 中的 DataFrame 列中提取數字？

問題描述

2 個解決方案

解決方案1 1 2022-03-26 08:49:51

解決方案2 0 2022-03-26 08:41:45

解決方案1
1 2022-03-26 08:49:51

解決方案2
0 2022-03-26 08:41:45