[英]How do I clean a list, and list of list of elements in a pandas dataframe?
Edited:编辑:
After writing this:写完之后:
m = df.explode('ID1').groupby('ID1')['ID2'].agg(list)
I have the following dataframe:我有以下 dataframe:
Ref
45263 [['3105-BB', '3106-BB', '3201-BB', '3202-BB'],...
45256 [['3105-BB', '3106-BB', '3201-BB', '3202-BB'],...
48565 [['3159-CC', '3217-CC'], ['3159-CC', '3217-CC']]
49365 [['3159-CC', '3217-CC'], ['3159-CC', '3217-CC']]
47548 [['3107-CC', '3108-CC', '3201-CC', '3202-CC'],...
In col on the right, how do I remove the lists of list brackets, and the duplicates for each row.在右侧的 col 中,如何删除列表括号的列表以及每行的重复项。 Ideally I'd like just a single list for each row?
理想情况下,我希望每行只有一个列表?
eg for output:例如 output:
Ref
45263 ['3105-BB', '3106-BB', '3201-BB', '3202-BB']
45256 ['3105-BB', '3106-BB', '3201-BB', '3202-BB']
48565 ['3159-CC', '3217-CC']
49365 ['3159-CC', '3217-CC']
47548 ['3107-CC', '3108-CC', '3201-CC', '3202-CC']
Afterwards I will use m
in the following:之后我将在下面使用
m
:
df['ID4'] = df['Ref'].map(m)
This will return a final dataframe I am looking for.这将返回我正在寻找的最终 dataframe。
Use set comprehension
with flatten values of nested lists:对嵌套列表的展平值使用
set comprehension
推导:
df['ID'] = df['ID'].apply(lambda x: list(set(z for y in x for z in y)))
If order is important use dict with keys
trick:如果顺序很重要,请使用带
keys
技巧的字典:
df['ID'] = df['ID'].apply(lambda x: list(dict.fromkeys([z for y in x for z in y]).keys()))
If next processing is map, you need explode lists:如果下一个处理是 map,你需要分解列表:
df = df.explode('ID').reset_index(drop=True)
print (df)
Ref ID
0 45263 3105-BB
1 45263 3106-BB
2 45263 3202-BB
3 45263 3201-BB
4 45256 3105-BB
5 45256 3106-BB
6 45256 3202-BB
7 45256 3201-BB
8 48565 3217-CC
9 48565 3159-CC
10 49365 3217-CC
11 49365 3159-CC
12 47548 3202-CC
13 47548 3108-CC
14 47548 3201-CC
15 47548 3107-CC
Sample :样本:
df['ID1'] = df['ID'].apply(lambda x: list(set(z for y in x for z in y)))
df['ID2'] = df['ID'].apply(lambda x: list(dict.fromkeys([z for y in x for z in y]).keys()))
print (df)
Ref ID \
0 45263 [[3105-BB, 3106-BB, 3201-BB, 3202-BB]]
1 45256 [[3105-BB, 3106-BB, 3201-BB, 3202-BB]]
2 48565 [[3159-CC, 3217-CC], [3159-CC, 3217-CC]]
3 49365 [[3159-CC, 3217-CC], [3159-CC, 3217-CC]]
4 47548 [[3107-CC, 3108-CC, 3201-CC, 3202-CC]]
ID1 ID2
0 [3105-BB, 3106-BB, 3202-BB, 3201-BB] [3105-BB, 3106-BB, 3201-BB, 3202-BB]
1 [3105-BB, 3106-BB, 3202-BB, 3201-BB] [3105-BB, 3106-BB, 3201-BB, 3202-BB]
2 [3217-CC, 3159-CC] [3159-CC, 3217-CC]
3 [3217-CC, 3159-CC] [3159-CC, 3217-CC]
4 [3202-CC, 3108-CC, 3201-CC, 3107-CC] [3107-CC, 3108-CC, 3201-CC, 3202-CC]
EDIT:编辑:
f = lambda x: list(set(z for y in x for z in y)
df.explode('ID1').groupby('ID1')['ID2'].agg(f)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.