根據其他列中 LIST 中的值創建新列

Question

這個問題看起來很簡單，但我遇到了很多麻煩，而且還沒有在任何地方看到它被問到。 我有一列在每一行中包含一個不同的列表，我要做的就是根據該列表中是否存在特定值來創建一個新列。 數據如下所示：

Col1
[5,6,23,7,20,21]    
[0,7,20,21]
[3,4,5,23,7,20,21]
[2,3,23,7,20,21]
[3,4,5,23,7,20,21]

每個數字對應一個特定的值，所以0 = 'apple' ， 2 = 'grape' ，等等...

雖然每個列表中有多個值，但我實際上只是在尋找某些值，特別0, 2, 4, 6, 16, 17

所以我想做的是添加一個新列，其值對應於在Col1中找到的數字。

這就是解決方案應該是什么：

Col1               Col2
[5,6,23,7,20,21]   Pear
[0,7,20,21]        Apple
[3,4,5,23,7,20,21] Watermelon
[2,3,23,7,20,21]   Grape
[16,20,21]         Pineapple

我努力了：

df['Col2'] = np.where(0 in df['Col1'], 'Apple',
                np.where(2 in df['Col1'], 'Grape', 
                   np.where(4 in df['Col1'], 'Watermelon', )

等等......但這會將所有值默認為Apple

Col1               Col2
[5,6,23,7,20,21]   Apple
[0,7,20,21]        Apple
[3,4,5,23,7,20,21] Apple
[2,3,23,7,20,21]   Apple
[16,20,21]         Apple

通過將上述內容放入for循環中，我能夠成功地做到這一點，但我遇到了問題。 代碼：

df['Col2'] = ''
for i in range(0,df.shape[0]):
   df['Col2'][i] = np.where(0 in df['Col1'][i], 'Apple',
                   np.where(2 in df['Col1'][i], 'Grape', 
                      np.where(4 in df['Col1'][i], 'Watermelon', )

我得到了我正在尋找的結果，但我遇到了一個警告：

<ipython-input-638-5dfd74b69688>:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

我認為警告是因為我已經創建了空白列，但我這樣做的唯一原因是因為如果我沒有創建它會出現錯誤。 此外，當我嘗試執行簡單的df['Col2'].value_counts()時，出現錯誤： TypeError: unhashable type: 'numpy.ndarray' 。 即使我收到此錯誤， value_counts()的結果仍然顯示，這很奇怪。

我不完全確定如何進行，我嘗試了很多其他方法來創建此列，但沒有一個能夠工作。 任何建議表示贊賞！

Answer 1

使用explode ：

d = {0: 'Apple', 2: 'Grape', 4: 'Watermelon', 6: 'Banana', 16: 'Pear', 17: 'Orange'}
df['Col2'] = df['Col1'].explode().map(d).dropna().groupby(level=0).apply(', '.join)
print(df)

# Output:
                       Col1        Col2
0     [5, 6, 23, 7, 20, 21]      Banana
1            [0, 7, 20, 21]       Apple
2  [3, 4, 5, 23, 7, 20, 21]  Watermelon
3     [2, 3, 23, 7, 20, 21]       Grape
4  [3, 4, 5, 23, 7, 20, 21]  Watermelon

Answer 2

您遍歷列表值和 map 到正確的水果，並忽略不需要的。 如果沒有匹配，則設置為 NaN。 使用str.join包括多個匹配的可能性。

要逐行應用此邏輯，請使用Series.apply

import numpy as np

mapping = {0: 'Apple', 2: 'Grape', 4: 'Watermelon'}

df['Col2'] = df['Col1'].apply(lambda lst: ', '.join(mapping[n] for n in lst if n in mapping) or np.nan)

Output：

>>> df

                       Col1        Col2
0     [5, 6, 23, 7, 20, 21]         NaN
1            [0, 7, 20, 21]       Apple
2  [3, 4, 5, 23, 7, 20, 21]  Watermelon
3     [2, 3, 23, 7, 20, 21]       Grape
4  [3, 4, 5, 23, 7, 20, 21]  Watermelon

表現

請注意，這應該比 Corralien 的解決方案更快。

設置：

df = pd.DataFrame({
    'Col1': [[5, 6, 23, 7, 20, 21],
             [0, 7, 20, 21],
             [3, 4, 5, 23, 7, 20, 21],
             [2, 3, 23, 7, 20, 21],
             [3, 4, 5, 23, 7, 20, 21]]
})

mapping = {0: 'Apple', 2: 'Grape', 4: 'Watermelon'}

def number_to_fruit(lst):
    return ', '.join(mapping[n] for n in lst if n in mapping) or np.nan

# Simulate a large DataFrame
n = 20000
df = pd.concat([df]*n, ignore_index=False)

>>> df.shape

(100000, 1)

計時：

# Using apply. (I've added dropna for a more fair comparison)
>>> %timeit -n 10 df['Col1'].apply(number_to_fruit).dropna()

116 ms ± 7.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Corralien's solution 
>>> %timeit -n 10 df['Col1'].explode().map(mapping).dropna().groupby(level=0).apply(', '.join)

710 ms ± 71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

根據其他列中 LIST 中的值創建新列

問題描述

2 個解決方案

解決方案1
3 已采納 2021-12-14 18:35:57

解決方案2
0 2021-12-14 18:55:04

根據其他列中 LIST 中的值創建新列

問題描述

2 個解決方案

解決方案1 3 已采納 2021-12-14 18:35:57

解決方案2 0 2021-12-14 18:55:04

解決方案1
3 已采納 2021-12-14 18:35:57

解決方案2
0 2021-12-14 18:55:04