我可以用另一列的特定列表元素填充一列的 NaN 值吗？

Question

例如，我有以下 dataframe （称为项目）：

| index | itemID | maintopic | subtopics          |
|:----- |:------:|:---------:| ------------------:|
| 1     | 235    | FBR       | [FZ, 1RH, FL]      |
| 2     | 1787   | NaN       | [1RH, YRS, FZ, FL] |
| 3     | 2454   | NaN       | [FZX, 1RH, FZL]    |
| 4     | 3165   | NaN       | [YHS]              |

我想用以字母开头的子主题列表的第一个元素填充主主题列中的 NaN 值。 有人有想法吗？ （问题 1）

我试过这个，但它没有用：

import pandas as pd
import string
alphabet = list(string.ascii_lowercase)
    
items['maintopic'] = items['maintopic'].apply(lambda x : items['maintopic'].fillna(items['subtopics'][x][0]) if items['subtopics'][x][0].lower().startswith(tuple(alphabet)) else x)

高级（问题 2）：更好的是查看子主题列表的所有元素，如果有更多元素具有第一个字母甚至第一个和第二个字母，那么我想采取这个。 例如第 2 行有 FZ 和 FL，所以我想用 F 填充这一行的 maintopic。第 3 行有 FZX 和 FZL，然后我想用 FZ 填充 maintopic。 但如果这太复杂了，那么我也会很高兴回答第 1 个问题。

我很感激任何帮助！

Answer 1

尝试：

from itertools import chain, combinations


def commonprefix(m):
    "Given a list of pathnames, returns the longest common leading component"
    if not m:
        return ""
    s1 = min(m)
    s2 = max(m)
    for i, c in enumerate(s1):
        if c != s2[i]:
            return s1[:i]
    return s1


def powerset(iterable, n=0):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(n, len(s) + 1))


def choose(x):
    if not isinstance(x, list):
        return x

    if len(x) == 1:
        return x[0]

    filtered = [v for v in x if not v[0].isdigit()]
    if not filtered:
        return np.nan

    longest = ""
    for s in powerset(filtered, 2):
        pref = commonprefix(s)
        if len(pref) > len(longest):
            longest = pref

    return filtered[0] if longest == "" else longest


m = df["maintopic"].isna()
df.loc[m, "maintopic"] = df.loc[m, "subtopics"].apply(choose)
print(df)

印刷：

   index  itemID maintopic           subtopics
0      1     235       FBR       [FZ, 1RH, FL]
1      2    1787         F  [1RH, YRS, FZ, FL]
2      3    2454        FZ     [FZX, 1RH, FZL]
3      4    3165       YHS               [YHS]

编辑：添加了对列表/浮动的检查。

Answer 2

第一个问题试试这个：

import pandas as pd
import numpy as np


def fill_value(sub):
    for i in sub:
        if i[0].isalpha():
            return i
    return sub[0]


data = {
    'maintopic': ['FBR', np.nan, np.nan, np.nan],
    'subtopic': [['FZ', '1RH', 'FL'] , ['1RH', 'YRS', 'FZ', 'FL'], ['FZX', '1RH', 'FZL'], ['YHS']]
}

df = pd.DataFrame(data)
print('Before\n', df)
df['maintopic'] = df.apply(
    lambda row: fill_value(row['subtopic']) if pd.isnull(row['maintopic']) else row['maintopic'],
    axis=1
)
print('\nAfter\n', df)

Output：

Before
   maintopic            subtopic
0       FBR       [FZ, 1RH, FL]
1       NaN  [1RH, YRS, FZ, FL]
2       NaN     [FZX, 1RH, FZL]
3       NaN               [YHS]

After
   maintopic            subtopic
0       FBR       [FZ, 1RH, FL]
1       YRS  [1RH, YRS, FZ, FL]
2       FZX     [FZX, 1RH, FZL]
3       YHS               [YHS]

您可以更改 fill_value function 以返回所需的值以填充 NaN 值。 现在，我返回了以字母开头的子主题的第一个值。

Answer 3

您可以这样做：获取子subtopics列列表中每个值中以第一个字母开头的所有子字符串，并构建一个计数器，然后根据频率对计数器中的项目进行排序。 如果项目的频率相同，请考虑最长的字符串。

from collections import Counter
from functools import cmp_to_key
def get_main_topic_modified(m, l):
    if m is not np.nan:
       return m
    if len(l) == 1:
       return l[0]
    res = []
    for s in l:
        il = [s[:i+1] for i in range(len(s)-1)]
        res.append(il)
    res = [item for s in res for item in s]
    c = Counter(res)
    d = dict(c)
    l = list(d.items())
    
    l.sort(key=cmp_to_key(lambda x, y: len(y[0])-len(x[0]) if x[1] == y[1] else y[1] - x[1]))
    
    return l[0][0]

df['maintopic'] = df[['maintopic', 'subtopics']].apply(
                       lambda x : get_main_topic_modified(*x), axis = 1)

Output：

  index itemID  maintopic            subtopics
0     1    235        FBR        [FZ, 1RH, FL]
1     2   1787          F   [1RH, YRS, FZ, FL]
2     3   2454         FZ      [FZX, 1RH, FZL]
3     4   3165        YHS                [YHS]

我可以用另一列的特定列表元素填充一列的 NaN 值吗？

问题描述

3 个解决方案

解决方案1
1 已采纳 2021-04-27 19:05:32

解决方案2
1 2021-04-27 19:17:38

解决方案3
1 2021-04-27 20:45:57

我可以用另一列的特定列表元素填充一列的 NaN 值吗？

问题描述

3 个解决方案

解决方案1 1 已采纳 2021-04-27 19:05:32

解决方案2 1 2021-04-27 19:17:38

解决方案3 1 2021-04-27 20:45:57

解决方案1
1 已采纳 2021-04-27 19:05:32

解决方案2
1 2021-04-27 19:17:38

解决方案3
1 2021-04-27 20:45:57