简体   繁体   English

如何从 python 中的 DataFrame 列中提取数字?

[英]How to extract numbers from a DataFrame column in python?

Recently I was working on a Data cleaning assignment, where I used age_of_marriage dataset.最近我在做一个数据清理任务,我在其中使用了 age_of_marriage 数据集。 I started to clean data, but in the dataset there is a "height" column which is of Object type.我开始清理数据,但在数据集中有一个 Object 类型的“高度”列。 It is in the format of feet and inch.它采用英尺和英寸的格式。数据集图像

I want to extract 'foot' and 'inch' from the data and convert it into 'cm' using the formula.我想从数据中提取“英尺”和“英寸”并使用公式将其转换为“厘米”。 I have the formula ready for the conversion but I am not able to extract it.我已经准备好用于转换的公式,但我无法提取它。 Also I want to convert it into Int datatype before applying the formula.我还想在应用公式之前将其转换为 Int 数据类型。 I am stuck on this mode.我被困在这种模式下。

-------- 2 height 2449 non-null object -------- ------ 2 高度 2449 非空 object ------

I am trying to extract it using String manipulation, but not able to do it.我正在尝试使用字符串操作来提取它,但无法做到。 Can anybody help.任何人都可以帮忙。

height高度
5'3" 5'3"
5'4" 5'4"

I have attached a github link to access the dataset.我附上了一个 github 链接来访问数据集。 text文本

import numpy as np
import pandas as pd
from collections import Counter

agemrg = pd.read_csv('age_of_marriage_data.csv')

for height in range(len(height_list)):
    BrideGroomHeight = height_list[height].rstrip(height_list[height][-1])
    foot_int = int(BrideGroomHeight[0])
    inch_int = int(BrideGroomHeight[2:4])
    print(foot_int)
    print(inch_int)
    
    if height in ['nan']:
        continue

output - 
5
4
5
7
5
7
5
0
5
5
5
5
5
2
5
5
5
5
5
1
5
3
5
9
5
10
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_12772/2525694838.py in <module>
      1 for height in range(len(height_list)):
----> 2     BrideGroomHeight = height_list[height].rstrip(height_list[height][-1])
      3     foot_int = int(BrideGroomHeight[0])
      4     inch_int = int(BrideGroomHeight[2:4])
      5     print(foot_int)

AttributeError: 'float' object has no attribute 'rstrip'

There are some nan values, due to which I am not able to perform this operation.有一些 nan 值,因此我无法执行此操作。

You can use .split() to get the feet and inches portion.您可以使用.split()获取英尺和英寸部分。 If you are certain you only have to deal with a few NaN rows, then a simple version could be:如果你确定你只需要处理几个 NaN 行,那么一个简单的版本可以是:

df['height_feet'] = df['height'].dropna().apply(lambda x: str(x).split("'")[0])
df['height_inches'] = df['height'].dropna().apply(lambda x: str(x).split("'")[-1][0:-1])
df[['height', 'height_feet', 'height_inches']]

Basically, the feet portion is the first piece in the split, and the inches portion is the last piece in the split but without the last character.基本上,脚部分是拆分中的第一块,英寸部分是拆分中的最后一块但没有最后一个字符。

Output: Output:

>>> print(df[['height', 'height_feet', 'height_inches']])
     height height_feet height_inches
0      5'4"           5             4
1      5'7"           5             7
2      5'7"           5             7
3      5'0"           5             0
4      5'5"           5             5
...     ...         ...           ...
2562   5'3"           5             3
2563  5'11"           5            11
2564   5'3"           5             3
2565  4'11"           4            11
2566   5'2"           5             2

[2567 rows x 3 columns]

You can use str.extract :您可以使用str.extract

df['height2'] = df['height'].str.extract(r'''(?P<ft>\d*)'(?P<in>\d+)"''') \
                            .astype(float).mul([30.48, 2.54]).sum(axis=1)

Or str.split and str.strip :或者str.splitstr.strip

df['height3'] = df['height'].str.rstrip('"').str.split("'", expand=True) \
                            .astype(float).mul([30.48, 2.54]).sum(axis=1)

Output: Output:

>>> df.filter(like='height')
     height  height2  height3
0      5'4"   162.56   162.56
1      5'7"   170.18   170.18
2      5'7"   170.18   170.18
3      5'0"   152.40   152.40
4      5'5"   165.10   165.10
...     ...      ...      ...
2562   5'3"   160.02   160.02
2563  5'11"   180.34   180.34
2564   5'3"   160.02   160.02
2565  4'11"   149.86   149.86
2566   5'2"   157.48   157.48

[2567 rows x 3 columns]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用python从以列中数字开头的字符串中提取数字 - how to extract numbers from a string that starts with numbers in a column using python 从熊猫数据框中的列中提取字符串中的数字 - Extract the numbers in a string from a column in pandas dataframe 如何从 dataframe 中的字符串中提取数字并将这些数字的倍数添加到同一 dataframe 的新列中 - How does one extract numbers from a string in a dataframe and add the multiple of these numbers in a new column of the same dataframe 如何从大型python数据框中的复杂字符串中提取数字 - How to extract numbers from a complex string in a large python dataframe 如何从python中的数据列中提取两个数字? - How to extract two numbers from column of data in python? 如何从python数据框列中的项目列表中提取项目? - How to extract a item from list of items in a python dataframe column? 如何从 Python Pandas Dataframe 的 STRING 列中提取嵌套字典? - How to extract a nested dictionary from a STRING column in Python Pandas Dataframe? 如何从 Pandas Python 中 DataFrame 中的列中的字符串中提取一些值? - How to extract some values from string in column in DataFrame in Pandas Python? 如何从 python pandas dataframe 的列中的列表中提取字符串? - How to extract strings from a list in a column in a python pandas dataframe? Python:如何从熊猫数据框列中提取多个字符串 - Python: How to extract multiple strings from pandas dataframe column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM