如何应用 function 将多个数字拆分到 Python 中的 dataframe 中的列的字段？

Question

I need to apply a function that splits multiple numbers from the fields of a dataframe.我需要应用一个 function 从 dataframe 的字段中拆分多个数字。

In this dataframe there a all the kids' measurements that are needed for a school: Name, Height, Weight, and Unique Code, and their dream career.在这个 dataframe 中有学校所需的所有孩子的测量值：姓名、身高、体重和唯一代码，以及他们梦想的职业。

The name is only formed of alpha-characters.该名称仅由字母字符组成。 But some kids might have both first name and middle name.但有些孩子可能既有名字又有中间名。 (eg Vivien Ester) （例如薇薇安酯）
The height is known to be >= 100 cm for every child.众所周知，每个孩子的身高>= 100 厘米。
The weight is known to be < 70 kg for every child.众所周知，每个孩子的体重< 70 公斤。
The unique code is known to be any number, but it is associated with the letters 'AX', for every child.已知唯一代码是任何数字，但它与每个孩子的字母“AX”相关联。 But the AX may not always be stick to the number (eg 7771AX), it might be a space next to it.但是 AX 可能并不总是固定在数字上（例如 7771AX），它可能是它旁边的一个空格。 (eg 100 AX) （例如 100 轴）
Every kid has its dream career每个孩子都有自己梦想的职业

They could appear in any order, but they always follow the rules from above.它们可以按任何顺序出现，但它们始终遵循上面的规则。 However, for some kids some measurements could not appear (eg: height or unique code or both are missing or all are missing).然而，对于一些孩子来说，一些测量值可能无法显示（例如：身高或唯一代码或两者都丢失或全部丢失）。

So the dataframe is this:所以 dataframe 是这样的：

data = { 'Dream Career': ['Scientist', 'Astronaut', 'Software Engineer', 'Doctor', 'Fashion Designer', 'Teacher', 'Architect'],
    'Measurements': ['Rachel 24.3 100.25 100 AX', '100.5 Tuuli 30.1', 'Michael 28.0 7771AX 113.75', 'Vivien Ester 40AX 115.20', 'Oliver 40.5', 'Julien 35.1 678 AX 111.1', 'Bee 20.0 100.80 88AX']
       }

df = pd.DataFrame (data, columns = ['Dream Career','Measurements'])

And it looks like this:它看起来像这样：

        Dream Career                Measurements
0          Scientist   Rachel 24.3 100.25 100 AX
1          Astronaut            100.5 Tuuli 30.1
2  Software Engineer  Michael 28.0 7771AX 113.75
3             Doctor    Vivien Ester 40AX 115.20
4   Fashion Designer                 Oliver 40.5
5            Teacher    Julien 35.1 678 AX 111.1
6          Architect        Bee 20.0 100.80 88AX

I try to split all of these measurements into different columns, based on the specified rules.我尝试根据指定的规则将所有这些测量值分成不同的列。

So the final dataframe should look like this:所以最终的 dataframe 应该是这样的：

       Dream Career         Names  Weight  Height Unique Code
0          Scientist       Rachael    24.3  100.25       100AX
1          Astronaut         Tuuli    30.1  100.50         NaN
2  Software Engineer       Michael    28.0  113.75      7771AX
3             Doctor  Vivien Ester     NaN  115.20        40AX
4   Fashion Designer        Oliver    40.5     NaN         NaN
5            Teacher        Julien    35.1  111.10       678AX
6          Architect           Bee    10.0  100.80        88AX

I tried this code and it works very well, but only on single strings.我试过这段代码，它工作得很好，但只适用于单个字符串。 And I need to do this while in the dataframe and still keep every's kid's associate dream career (so the order is not lost).而且我需要在 dataframe 中执行此操作，并且仍然保持每个孩子的助理梦想职业（所以订单不会丢失）。

num_rx = r'[-+]?\.?\d+(?:,\d{3})*\.?\d*(?:[eE][-+]?\d+)?'
def get_weight_height(s):
    nums = re.findall(num_rx, s)
    height = np.nan
    weight = np.nan
    if (len(nums) == 0):
        height = np.nan
        weight = np.nan
    elif (len(nums) == 1):
        if float(nums[0]) >= 100:
            height = nums[0]
            weight = np.nan
        else:
            weight = nums[0]
            height = np.nan
    elif (len(nums) == 2):
        if float(nums[0]) >= 100:
            height = nums[0]
            weight = nums[1]
        else:
            height = nums[1]
            weight = nums[0]
    return height, weight

class_code = {'Small': 'AX', 'Mid': 'BX', 'High': 'CX'}

def hasNumbers(inputString):
    return any(char.isdigit() for char in inputString)

def extract_measurements(string, substring_name):
    height = np.nan
    weight = np.nan
    unique_code = np.nan
    name = ''
    if hasNumbers(string):
        num_rx = r'[-+]?\.?\d+(?:,\d{3})*\.?\d*(?:[eE][-+]?\d+)?'
        nums = re.findall(num_rx, string)
        if (substring_name in string):
            special_match = re.search(rf'{num_rx}(?=\s*{substring_name}\b)', string)
            if special_match:
                unique_code = special_match.group()
                string = string.replace(unique_code, '')
                unique_code = unique_code + substring_name
            if len(nums) >= 2 & len(nums) <= 3:
                height, weight = get_weight_height(string)
        else:
            height, weight = get_weight_height(string)
    name = " ".join(re.findall("[a-zA-Z]+", string))
    name = name.replace(substring_name,'')
    return format(float(height), '.2f'), float(weight), unique_code, name

And I apply it like this:我这样应用它：

string = 'Anya 101.30 23 4546AX'
height, weight, unique_code, name = extract_measurements(string, class_code['Small'])        
print( 'name is: ', name, '\nh is: ', height, '\nw is: ', weight, '\nunique code is: ', unique_code)

The results are very good.结果非常好。

I tried to apply the function on the dataframe, but I don't know how, I tried this as I got inspired from this and this and this... but they are all different than my problem:我试图在 dataframe 上应用 function，但我不知道如何，我尝试了这个，因为我从这个和这个和这个中得到启发......但它们都与我的问题不同：

df['height'], df['weight'], df['unique_code'], df['name'] = extract_measurements(df['Measurements'], class_code['Small'])

I cannot figure out how to apply it on my dataframe.我不知道如何在我的 dataframe 上应用它。 Please help me.请帮我。

I am at the very beginning, I highly appreciate all the help if you could possibly help me!我刚开始，如果你能帮助我，我非常感谢所有的帮助！

Answer 1

Use apply for rows ( axis=1 ) and choose 'expand' option.使用apply for rows ( axis=1 ) 并选择“expand”选项。 Then rename columns and concat to the original df:然后重命名列和连接到原来的df：

pd.concat([df,(df.apply(lambda row : extract_measurements(row['Measurements'], class_code['Small']), axis = 1, result_type='expand')
   .rename(columns = {0:'height', 1:'weight', 2:'unique_code', 3:'name'})
)], axis = 1)

output: output：

    Dream Career       Measurements                  height    weight  unique_code    name
--  -----------------  --------------------------  --------  --------  -------------  ------------
 0  Scientist          Rachel 24.3 100.25 100 AX        100       100  100AX          Rachel
 1  Astronaut          100.5 Tuuli 30.1                 100       100  nan            Tuuli
 2  Software Engineer  Michael 28.0 7771AX 113.75       100       100  7771AX         Michael
 3  Doctor             Vivien Ester 40AX 115.20         100       100  40AX           Vivien Ester
 4  Fashion Designer   Oliver 40.5                      100       100  nan            Oliver
 5  Teacher            Julien 35.1 678 AX 111.1         100       100  678AX          Julien
 6  Architect          Bee 20.0 100.80 88AX             100       100  88AX           Bee

(note I stubbed def get_weight_height(string) function because your coded did not include it, to always return 100,100) （注意我存根def get_weight_height(string) function 因为你的编码没有包括它，总是返回 100,100）

Answer 2

@piterbarg's answer seems efficient given the original functions, but the functions seem verbose to me.考虑到原始功能，@piterbarg 的答案似乎很有效，但这些功能对我来说似乎很冗长。 I'm sure there's a simpler solution here that what I'm doing, but what I have below replaces the functions in OP with I think the same results.我确信这里有一个更简单的解决方案，我正在做的事情，但是我下面的内容用我认为相同的结果替换了 OP 中的函数。

First changing the column names to snake case for ease:为了方便起见，首先将列名更改为蛇形大小写：

df = pd.DataFrame({
     'dream_career': ['Scientist', 'Astronaut', 'Software Engineer', 'Doctor',
                      'Fashion Designer', 'Teacher', 'Architect'],
     'measurements': ['Rachel 24.3 100.25 100 AX', '100.5 Tuuli 30.1',
                      'Michael 28.0 7771AX 113.75', 'Vivien Ester 40AX 115.20',
                      'Oliver 40.5', 'Julien 35.1 678 AX 111.1',
                      'Bee 20.0 100.80 88AX']
})

First the strings in .measurements are turned into lists.首先.measurements中的字符串被转换成列表。 From here on list comphrehensions will be applied to each list to filter values.从这里开始，列表理解将应用于每个列表以过滤值。

df.measurements = df.measurements.str.split()

0    [Rachel, 24.3, 100.25, 100, AX]
1               [100.5, Tuuli, 30.1]
2    [Michael, 28.0, 7771AX, 113.75]
3      [Vivien, Ester, 40AX, 115.20]
4                     [Oliver, 40.5]
5     [Julien, 35.1, 678, AX, 111.1]
6          [Bee, 20.0, 100.80, 88AX]
Name: measurements, dtype: object

The second step is filtering out the 'AX' from .measurements and appending 'AX' to all integers.第二步是从.measurements中过滤掉'AX'并将'AX'附加到所有整数。 This assumes this example is totally reproducible and all the height/weight measurements are floats, but a different differentiator could be used if this isn't the case.这假设此示例是完全可重复的，并且所有身高/体重测量值都是浮点数，但如果不是这种情况，可以使用不同的微分器。

df.measurements = df.measurements.apply(
     lambda val_list: [val for val in val_list if val!='AX']
).apply(
      lambda val_list: [str(val)+'AX' if val.isnumeric() else val
                        for val in val_list]
)

0      [Rachel, 24.3, 100.25, 100AX]
1               [100.5, Tuuli, 30.1]
2    [Michael, 28.0, 7771AX, 113.75]
3      [Vivien, Ester, 40AX, 115.20]
4                     [Oliver, 40.5]
5       [Julien, 35.1, 678AX, 111.1]
6          [Bee, 20.0, 100.80, 88AX]
Name: measurements, dtype: object

.name and .unique_code are pretty easy to grab. .name和.unique_code很容易获取。 With .unique_code I had to apply a second lambda function to insert NaNs.使用.unique_code我必须应用第二个 lambda function 来插入 NaN。 If there are missing values for .name in the original df the same thing will need to be done there.如果原始 df 中缺少.name的值，则需要在那里完成相同的事情。 For cases of multiple names, these are joined together separated with a space.对于多个名称的情况，将它们连接在一起并用空格分隔。

df['name'] = df.measurements.apply(
    lambda val_list: ' '.join([val for val in val_list if val.isalpha()])
)

df['unique_code'] = df.measurements.apply(
    lambda val_list: [val for val in val_list if 'AX' in val]
).apply(
    lambda x: np.nan if len(x)==0 else x[0]
)

For height and weight I needed to create a column of numerics first and work off that.对于身高和体重，我需要先创建一列数字并解决它。 In cases where there are missing values I'm having to come back around to deal with those.在缺少值的情况下，我不得不回来处理这些值。

import re

df['numerics'] = df.measurements.apply(
    lambda val_list: [float(val) for val in val_list
                      if not re.search('[a-zA-Z]', val)]
)

df['height'] = df.numerics.apply(
    lambda val_list: [val for val in val_list if val < 70]
).apply(
    lambda x: np.nan if len(x)==0 else x[0]
)

df['weight'] = df.numerics.apply(
    lambda val_list: [val for val in val_list if val >= 100]
).apply(
    lambda x: np.nan if len(x)==0 else x[0]
)

Finally, .measurements and .numerics are dropped, and the df should be ready to go.最后， .measurements和.numerics ，df 应该准备好 go。

df = df.drop(columns=['measurements', 'numerics'])

        dream_career          name unique_code  height  weight
0          Scientist        Rachel       100AX    24.3  100.25
1          Astronaut         Tuuli         NaN    30.1  100.50
2  Software Engineer       Michael      7771AX    28.0  113.75
3             Doctor  Vivien Ester        40AX     NaN  115.20
4   Fashion Designer        Oliver         NaN    40.5     NaN
5            Teacher        Julien       678AX    35.1  111.10
6          Architect           Bee        88AX    20.0  100.80

如何应用 function 将多个数字拆分到 Python 中的 dataframe 中的列的字段？

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-04-01 22:09:02

解决方案2
0 2021-04-01 23:13:07

如何应用 function 将多个数字拆分到 Python 中的 dataframe 中的列的字段？

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-04-01 22:09:02

解决方案2 0 2021-04-01 23:13:07

解决方案1
1 已采纳 2021-04-01 22:09:02

解决方案2
0 2021-04-01 23:13:07