简体   繁体   English

根据其他列中 LIST 中的值创建新列

[英]Create new column based on values in LIST in other column

This question seems so simple yet I'm having so much trouble, and haven't seen it asked anywhere.这个问题看起来很简单,但我遇到了很多麻烦,而且还没有在任何地方看到它被问到。 I have a column that contains a different list in each row, and all I want to do is create a new column based on if a specific value is in that list.我有一列在每一行中包含一个不同的列表,我要做的就是根据该列表中是否存在特定值来创建一个新列。 Data looks like this:数据如下所示:

Col1
[5,6,23,7,20,21]    
[0,7,20,21]
[3,4,5,23,7,20,21]
[2,3,23,7,20,21]
[3,4,5,23,7,20,21]

Each number corresponds to a specific value, so 0 = 'apple' , 2 = 'grape' , etc...每个数字对应一个特定的值,所以0 = 'apple'2 = 'grape' ,等等...

While there are multiple values in each list, I'm really only looking for certain values, specifically 0, 2, 4, 6, 16, 17虽然每个列表中有多个值,但我实际上只是在寻找某些值,特别0, 2, 4, 6, 16, 17

So what I want to do is add a new column, with the value that corresponds to the number that's found within Col1 .所以我想做的是添加一个新列,其值对应于在Col1中找到的数字。

This is what the solution should be:这就是解决方案应该是什么:

Col1               Col2
[5,6,23,7,20,21]   Pear
[0,7,20,21]        Apple
[3,4,5,23,7,20,21] Watermelon
[2,3,23,7,20,21]   Grape
[16,20,21]         Pineapple

I have tried:我努力了:

df['Col2'] = np.where(0 in df['Col1'], 'Apple',
                np.where(2 in df['Col1'], 'Grape', 
                   np.where(4 in df['Col1'], 'Watermelon', )

And so on... But this defaults all values to Apple等等......但这会将所有值默认为Apple

Col1               Col2
[5,6,23,7,20,21]   Apple
[0,7,20,21]        Apple
[3,4,5,23,7,20,21] Apple
[2,3,23,7,20,21]   Apple
[16,20,21]         Apple

I was able to successfully do it by putting the above in a for loop, but I am getting issues.通过将上述内容放入for循环中,我能够成功地做到这一点,但我遇到了问题。 Code:代码:

df['Col2'] = ''
for i in range(0,df.shape[0]):
   df['Col2'][i] = np.where(0 in df['Col1'][i], 'Apple',
                   np.where(2 in df['Col1'][i], 'Grape', 
                      np.where(4 in df['Col1'][i], 'Watermelon', )

I get the result I am looking for, but I am being met with a warning:我得到了我正在寻找的结果,但我遇到了一个警告:

<ipython-input-638-5dfd74b69688>:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

I assume the warning is because I have already created the blank column, but the only reason I did this is because I would get an error if I didn't create it.我认为警告是因为我已经创建了空白列,但我这样做的唯一原因是因为如果我没有创建它会出现错误。 Furthermore, when I attempt to perform a simple df['Col2'].value_counts() , I get an error: TypeError: unhashable type: 'numpy.ndarray' .此外,当我尝试执行简单的df['Col2'].value_counts()时,出现错误: TypeError: unhashable type: 'numpy.ndarray' The result from value_counts() still shows up even though I get this error, which is odd.即使我收到此错误, value_counts()的结果仍然显示,这很奇怪。

I am not entirely sure how else to proceed, I've tried a bunch of other things to create this column but none have been able to work.我不完全确定如何进行,我尝试了很多其他方法来创建此列,但没有一个能够工作。 Any advice appreciated!任何建议表示赞赏!

Use explode :使用explode

d = {0: 'Apple', 2: 'Grape', 4: 'Watermelon', 6: 'Banana', 16: 'Pear', 17: 'Orange'}
df['Col2'] = df['Col1'].explode().map(d).dropna().groupby(level=0).apply(', '.join)
print(df)

# Output:
                       Col1        Col2
0     [5, 6, 23, 7, 20, 21]      Banana
1            [0, 7, 20, 21]       Apple
2  [3, 4, 5, 23, 7, 20, 21]  Watermelon
3     [2, 3, 23, 7, 20, 21]       Grape
4  [3, 4, 5, 23, 7, 20, 21]  Watermelon

You loop through the list value and map them to the correct fruit, and ignore the unwanted ones.您遍历列表值和 map 到正确的水果,并忽略不需要的。 Set to NaN if there is no match.如果没有匹配,则设置为 NaN。 Use str.join to include the possibility of multiple matches.使用str.join包括多个匹配的可能性。

To apply this logic row-wise use Series.apply要逐行应用此逻辑,请使用Series.apply

import numpy as np

mapping = {0: 'Apple', 2: 'Grape', 4: 'Watermelon'}

df['Col2'] = df['Col1'].apply(lambda lst: ', '.join(mapping[n] for n in lst if n in mapping) or np.nan)

Output: Output:

>>> df

                       Col1        Col2
0     [5, 6, 23, 7, 20, 21]         NaN
1            [0, 7, 20, 21]       Apple
2  [3, 4, 5, 23, 7, 20, 21]  Watermelon
3     [2, 3, 23, 7, 20, 21]       Grape
4  [3, 4, 5, 23, 7, 20, 21]  Watermelon

Performance表现

Note that this is should be faster than Corralien's solution.请注意,这应该比 Corralien 的解决方案更快。

Setup:设置:

df = pd.DataFrame({
    'Col1': [[5, 6, 23, 7, 20, 21],
             [0, 7, 20, 21],
             [3, 4, 5, 23, 7, 20, 21],
             [2, 3, 23, 7, 20, 21],
             [3, 4, 5, 23, 7, 20, 21]]
})

mapping = {0: 'Apple', 2: 'Grape', 4: 'Watermelon'}

def number_to_fruit(lst):
    return ', '.join(mapping[n] for n in lst if n in mapping) or np.nan

# Simulate a large DataFrame
n = 20000
df = pd.concat([df]*n, ignore_index=False)

>>> df.shape

(100000, 1)

Timmings:计时:

# Using apply. (I've added dropna for a more fair comparison)
>>> %timeit -n 10 df['Col1'].apply(number_to_fruit).dropna()

116 ms ± 7.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Corralien's solution 
>>> %timeit -n 10 df['Col1'].explode().map(mapping).dropna().groupby(level=0).apply(', '.join)

710 ms ± 71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM