[英]How to extract numbers from a DataFrame column in python?
Recently I was working on a Data cleaning assignment, where I used age_of_marriage dataset.最近我在做一个数据清理任务,我在其中使用了 age_of_marriage 数据集。 I started to clean data, but in the dataset there is a "height" column which is of Object type.
我开始清理数据,但在数据集中有一个 Object 类型的“高度”列。 It is in the format of feet and inch.
它采用英尺和英寸的格式。
I want to extract 'foot' and 'inch' from the data and convert it into 'cm' using the formula.我想从数据中提取“英尺”和“英寸”并使用公式将其转换为“厘米”。 I have the formula ready for the conversion but I am not able to extract it.
我已经准备好用于转换的公式,但我无法提取它。 Also I want to convert it into Int datatype before applying the formula.
我还想在应用公式之前将其转换为 Int 数据类型。 I am stuck on this mode.
我被困在这种模式下。
-------- 2 height 2449 non-null object -------- ------ 2 高度 2449 非空 object ------
I am trying to extract it using String manipulation, but not able to do it.我正在尝试使用字符串操作来提取它,但无法做到。 Can anybody help.
任何人都可以帮忙。
height![]() |
---|
5'3" ![]() |
5'4" ![]() |
I have attached a github link to access the dataset.我附上了一个 github 链接来访问数据集。 text
文本
import numpy as np
import pandas as pd
from collections import Counter
agemrg = pd.read_csv('age_of_marriage_data.csv')
for height in range(len(height_list)):
BrideGroomHeight = height_list[height].rstrip(height_list[height][-1])
foot_int = int(BrideGroomHeight[0])
inch_int = int(BrideGroomHeight[2:4])
print(foot_int)
print(inch_int)
if height in ['nan']:
continue
output -
5
4
5
7
5
7
5
0
5
5
5
5
5
2
5
5
5
5
5
1
5
3
5
9
5
10
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_12772/2525694838.py in <module>
1 for height in range(len(height_list)):
----> 2 BrideGroomHeight = height_list[height].rstrip(height_list[height][-1])
3 foot_int = int(BrideGroomHeight[0])
4 inch_int = int(BrideGroomHeight[2:4])
5 print(foot_int)
AttributeError: 'float' object has no attribute 'rstrip'
There are some nan values, due to which I am not able to perform this operation.有一些 nan 值,因此我无法执行此操作。
You can use .split()
to get the feet and inches portion.您可以使用
.split()
获取英尺和英寸部分。 If you are certain you only have to deal with a few NaN rows, then a simple version could be:如果你确定你只需要处理几个 NaN 行,那么一个简单的版本可以是:
df['height_feet'] = df['height'].dropna().apply(lambda x: str(x).split("'")[0])
df['height_inches'] = df['height'].dropna().apply(lambda x: str(x).split("'")[-1][0:-1])
df[['height', 'height_feet', 'height_inches']]
Basically, the feet portion is the first piece in the split, and the inches portion is the last piece in the split but without the last character.基本上,脚部分是拆分中的第一块,英寸部分是拆分中的最后一块但没有最后一个字符。
Output: Output:
>>> print(df[['height', 'height_feet', 'height_inches']])
height height_feet height_inches
0 5'4" 5 4
1 5'7" 5 7
2 5'7" 5 7
3 5'0" 5 0
4 5'5" 5 5
... ... ... ...
2562 5'3" 5 3
2563 5'11" 5 11
2564 5'3" 5 3
2565 4'11" 4 11
2566 5'2" 5 2
[2567 rows x 3 columns]
You can use str.extract
:您可以使用
str.extract
:
df['height2'] = df['height'].str.extract(r'''(?P<ft>\d*)'(?P<in>\d+)"''') \
.astype(float).mul([30.48, 2.54]).sum(axis=1)
Or str.split
and str.strip
:或者
str.split
和str.strip
:
df['height3'] = df['height'].str.rstrip('"').str.split("'", expand=True) \
.astype(float).mul([30.48, 2.54]).sum(axis=1)
Output: Output:
>>> df.filter(like='height')
height height2 height3
0 5'4" 162.56 162.56
1 5'7" 170.18 170.18
2 5'7" 170.18 170.18
3 5'0" 152.40 152.40
4 5'5" 165.10 165.10
... ... ... ...
2562 5'3" 160.02 160.02
2563 5'11" 180.34 180.34
2564 5'3" 160.02 160.02
2565 4'11" 149.86 149.86
2566 5'2" 157.48 157.48
[2567 rows x 3 columns]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.