[英]How to apply a function that splits multiple numbers to the fields of a column in a dataframe in Python?
I need to apply a function that splits multiple numbers from the fields of a dataframe.我需要应用一个 function 从 dataframe 的字段中拆分多个数字。
In this dataframe there a all the kids' measurements that are needed for a school: Name, Height, Weight, and Unique Code, and their dream career.在这个 dataframe 中有学校所需的所有孩子的测量值:姓名、身高、体重和唯一代码,以及他们梦想的职业。
They could appear in any order, but they always follow the rules from above.它们可以按任何顺序出现,但它们始终遵循上面的规则。 However, for some kids some measurements could not appear (eg: height or unique code or both are missing or all are missing).
然而,对于一些孩子来说,一些测量值可能无法显示(例如:身高或唯一代码或两者都丢失或全部丢失)。
So the dataframe is this:所以 dataframe 是这样的:
data = { 'Dream Career': ['Scientist', 'Astronaut', 'Software Engineer', 'Doctor', 'Fashion Designer', 'Teacher', 'Architect'],
'Measurements': ['Rachel 24.3 100.25 100 AX', '100.5 Tuuli 30.1', 'Michael 28.0 7771AX 113.75', 'Vivien Ester 40AX 115.20', 'Oliver 40.5', 'Julien 35.1 678 AX 111.1', 'Bee 20.0 100.80 88AX']
}
df = pd.DataFrame (data, columns = ['Dream Career','Measurements'])
And it looks like this:它看起来像这样:
Dream Career Measurements
0 Scientist Rachel 24.3 100.25 100 AX
1 Astronaut 100.5 Tuuli 30.1
2 Software Engineer Michael 28.0 7771AX 113.75
3 Doctor Vivien Ester 40AX 115.20
4 Fashion Designer Oliver 40.5
5 Teacher Julien 35.1 678 AX 111.1
6 Architect Bee 20.0 100.80 88AX
I try to split all of these measurements into different columns, based on the specified rules.我尝试根据指定的规则将所有这些测量值分成不同的列。
So the final dataframe should look like this:所以最终的 dataframe 应该是这样的:
Dream Career Names Weight Height Unique Code
0 Scientist Rachael 24.3 100.25 100AX
1 Astronaut Tuuli 30.1 100.50 NaN
2 Software Engineer Michael 28.0 113.75 7771AX
3 Doctor Vivien Ester NaN 115.20 40AX
4 Fashion Designer Oliver 40.5 NaN NaN
5 Teacher Julien 35.1 111.10 678AX
6 Architect Bee 10.0 100.80 88AX
I tried this code and it works very well, but only on single strings.我试过这段代码,它工作得很好,但只适用于单个字符串。 And I need to do this while in the dataframe and still keep every's kid's associate dream career (so the order is not lost).
而且我需要在 dataframe 中执行此操作,并且仍然保持每个孩子的助理梦想职业(所以订单不会丢失)。
num_rx = r'[-+]?\.?\d+(?:,\d{3})*\.?\d*(?:[eE][-+]?\d+)?'
def get_weight_height(s):
nums = re.findall(num_rx, s)
height = np.nan
weight = np.nan
if (len(nums) == 0):
height = np.nan
weight = np.nan
elif (len(nums) == 1):
if float(nums[0]) >= 100:
height = nums[0]
weight = np.nan
else:
weight = nums[0]
height = np.nan
elif (len(nums) == 2):
if float(nums[0]) >= 100:
height = nums[0]
weight = nums[1]
else:
height = nums[1]
weight = nums[0]
return height, weight
class_code = {'Small': 'AX', 'Mid': 'BX', 'High': 'CX'}
def hasNumbers(inputString):
return any(char.isdigit() for char in inputString)
def extract_measurements(string, substring_name):
height = np.nan
weight = np.nan
unique_code = np.nan
name = ''
if hasNumbers(string):
num_rx = r'[-+]?\.?\d+(?:,\d{3})*\.?\d*(?:[eE][-+]?\d+)?'
nums = re.findall(num_rx, string)
if (substring_name in string):
special_match = re.search(rf'{num_rx}(?=\s*{substring_name}\b)', string)
if special_match:
unique_code = special_match.group()
string = string.replace(unique_code, '')
unique_code = unique_code + substring_name
if len(nums) >= 2 & len(nums) <= 3:
height, weight = get_weight_height(string)
else:
height, weight = get_weight_height(string)
name = " ".join(re.findall("[a-zA-Z]+", string))
name = name.replace(substring_name,'')
return format(float(height), '.2f'), float(weight), unique_code, name
And I apply it like this:我这样应用它:
string = 'Anya 101.30 23 4546AX'
height, weight, unique_code, name = extract_measurements(string, class_code['Small'])
print( 'name is: ', name, '\nh is: ', height, '\nw is: ', weight, '\nunique code is: ', unique_code)
The results are very good.结果非常好。
I tried to apply the function on the dataframe, but I don't know how, I tried this as I got inspired from this and this and this... but they are all different than my problem:我试图在 dataframe 上应用 function,但我不知道如何,我尝试了这个,因为我从这个和这个和这个中得到启发......但它们都与我的问题不同:
df['height'], df['weight'], df['unique_code'], df['name'] = extract_measurements(df['Measurements'], class_code['Small'])
I cannot figure out how to apply it on my dataframe.我不知道如何在我的 dataframe 上应用它。 Please help me.
请帮我。
I am at the very beginning, I highly appreciate all the help if you could possibly help me!我刚开始,如果你能帮助我,我非常感谢所有的帮助!
Use apply
for rows ( axis=1
) and choose 'expand' option.使用
apply
for rows ( axis=1
) 并选择“expand”选项。 Then rename columns and concat to the original df:然后重命名列和连接到原来的df:
pd.concat([df,(df.apply(lambda row : extract_measurements(row['Measurements'], class_code['Small']), axis = 1, result_type='expand')
.rename(columns = {0:'height', 1:'weight', 2:'unique_code', 3:'name'})
)], axis = 1)
output: output:
Dream Career Measurements height weight unique_code name
-- ----------------- -------------------------- -------- -------- ------------- ------------
0 Scientist Rachel 24.3 100.25 100 AX 100 100 100AX Rachel
1 Astronaut 100.5 Tuuli 30.1 100 100 nan Tuuli
2 Software Engineer Michael 28.0 7771AX 113.75 100 100 7771AX Michael
3 Doctor Vivien Ester 40AX 115.20 100 100 40AX Vivien Ester
4 Fashion Designer Oliver 40.5 100 100 nan Oliver
5 Teacher Julien 35.1 678 AX 111.1 100 100 678AX Julien
6 Architect Bee 20.0 100.80 88AX 100 100 88AX Bee
(note I stubbed def get_weight_height(string)
function because your coded did not include it, to always return 100,100) (注意我存根
def get_weight_height(string)
function 因为你的编码没有包括它,总是返回 100,100)
@piterbarg's answer seems efficient given the original functions, but the functions seem verbose to me.考虑到原始功能,@piterbarg 的答案似乎很有效,但这些功能对我来说似乎很冗长。 I'm sure there's a simpler solution here that what I'm doing, but what I have below replaces the functions in OP with I think the same results.
我确信这里有一个更简单的解决方案,我正在做的事情,但是我下面的内容用我认为相同的结果替换了 OP 中的函数。
First changing the column names to snake case for ease:为了方便起见,首先将列名更改为蛇形大小写:
df = pd.DataFrame({
'dream_career': ['Scientist', 'Astronaut', 'Software Engineer', 'Doctor',
'Fashion Designer', 'Teacher', 'Architect'],
'measurements': ['Rachel 24.3 100.25 100 AX', '100.5 Tuuli 30.1',
'Michael 28.0 7771AX 113.75', 'Vivien Ester 40AX 115.20',
'Oliver 40.5', 'Julien 35.1 678 AX 111.1',
'Bee 20.0 100.80 88AX']
})
First the strings in .measurements
are turned into lists.首先
.measurements
中的字符串被转换成列表。 From here on list comphrehensions will be applied to each list to filter values.从这里开始,列表理解将应用于每个列表以过滤值。
df.measurements = df.measurements.str.split()
0 [Rachel, 24.3, 100.25, 100, AX]
1 [100.5, Tuuli, 30.1]
2 [Michael, 28.0, 7771AX, 113.75]
3 [Vivien, Ester, 40AX, 115.20]
4 [Oliver, 40.5]
5 [Julien, 35.1, 678, AX, 111.1]
6 [Bee, 20.0, 100.80, 88AX]
Name: measurements, dtype: object
The second step is filtering out the 'AX'
from .measurements
and appending 'AX'
to all integers.第二步是从
.measurements
中过滤掉'AX'
并将'AX'
附加到所有整数。 This assumes this example is totally reproducible and all the height/weight measurements are floats, but a different differentiator could be used if this isn't the case.这假设此示例是完全可重复的,并且所有身高/体重测量值都是浮点数,但如果不是这种情况,可以使用不同的微分器。
df.measurements = df.measurements.apply(
lambda val_list: [val for val in val_list if val!='AX']
).apply(
lambda val_list: [str(val)+'AX' if val.isnumeric() else val
for val in val_list]
)
0 [Rachel, 24.3, 100.25, 100AX]
1 [100.5, Tuuli, 30.1]
2 [Michael, 28.0, 7771AX, 113.75]
3 [Vivien, Ester, 40AX, 115.20]
4 [Oliver, 40.5]
5 [Julien, 35.1, 678AX, 111.1]
6 [Bee, 20.0, 100.80, 88AX]
Name: measurements, dtype: object
.name
and .unique_code
are pretty easy to grab. .name
和.unique_code
很容易获取。 With .unique_code
I had to apply a second lambda function to insert NaNs.使用
.unique_code
我必须应用第二个 lambda function 来插入 NaN。 If there are missing values for .name
in the original df the same thing will need to be done there.如果原始 df 中缺少
.name
的值,则需要在那里完成相同的事情。 For cases of multiple names, these are joined together separated with a space.对于多个名称的情况,将它们连接在一起并用空格分隔。
df['name'] = df.measurements.apply(
lambda val_list: ' '.join([val for val in val_list if val.isalpha()])
)
df['unique_code'] = df.measurements.apply(
lambda val_list: [val for val in val_list if 'AX' in val]
).apply(
lambda x: np.nan if len(x)==0 else x[0]
)
For height and weight I needed to create a column of numerics first and work off that.对于身高和体重,我需要先创建一列数字并解决它。 In cases where there are missing values I'm having to come back around to deal with those.
在缺少值的情况下,我不得不回来处理这些值。
import re
df['numerics'] = df.measurements.apply(
lambda val_list: [float(val) for val in val_list
if not re.search('[a-zA-Z]', val)]
)
df['height'] = df.numerics.apply(
lambda val_list: [val for val in val_list if val < 70]
).apply(
lambda x: np.nan if len(x)==0 else x[0]
)
df['weight'] = df.numerics.apply(
lambda val_list: [val for val in val_list if val >= 100]
).apply(
lambda x: np.nan if len(x)==0 else x[0]
)
Finally, .measurements
and .numerics
are dropped, and the df should be ready to go.最后,
.measurements
和.numerics
,df 应该准备好 go。
df = df.drop(columns=['measurements', 'numerics'])
dream_career name unique_code height weight
0 Scientist Rachel 100AX 24.3 100.25
1 Astronaut Tuuli NaN 30.1 100.50
2 Software Engineer Michael 7771AX 28.0 113.75
3 Doctor Vivien Ester 40AX NaN 115.20
4 Fashion Designer Oliver NaN 40.5 NaN
5 Teacher Julien 678AX 35.1 111.10
6 Architect Bee 88AX 20.0 100.80
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.