[英]How to create a list of unique ID from a column in pandas where lists of ID are mentioned as strings in Python
I have a pandas dataframe df 我有一个pandas数据帧df
import pandas as pd
lst = [23682, 21963, 9711, 21175, 13022,1662,7399, 13679, 17654,4567,23608,2828, 1234]
lst_match = ['[21963]','[21175]', '[1662 7399 13679 ]','[17654 23608]','[2828]','0','0','0','0','0','0', '0','0' ]
df = pd.DataFrame(list(zip(lst, lst_match)),columns=['ID','ID_match'])
df DF
ID ID_match
0 23682 [21963]
1 21963 [21175]
2 9711 [1662 7399 13679]
3 21175 [17654 23608]
4 13022 [2828]
5 1662 0
6 7399 0
7 13679 0
8 17654 0
9 4567 0
10 23608 0
11 2828 0
12 1234 0
The values in ID_match column are also IDs though in a list in string format. ID_match列中的值也是ID,但是在字符串格式的列表中。
I want to create a dataframe of unique IDs in such a manner that my unique ID frame should contain all the ID which have some value other than 0 in ID_match column and those IDs' which are mentioned in the ID_match column. 我想创建一个唯一ID的数据帧,使得我的唯一ID帧应该包含ID_match列中具有除0以外值的所有ID以及ID_match列中提到的那些ID。
so my output dataframe of unique ID's must look like: 所以我的唯一ID输出数据框必须如下所示:
ID
0 23682
1 21963
2 9711
3 21175
4 13022
5 1662
6 7399
7 13679
8 17654
9 23608
10 2828
How can I do this with python pandas? 我怎么能用python pandas做到这一点?
Use: 采用:
s = (df[df['ID_match'] != '0']
.set_index('ID')['ID_match']
.str.strip('[ ]')
.str.split('\s+', expand=True)
.stack())
print (s)
23682 0 21963
21963 0 21175
9711 0 1662
1 7399
2 13679
21175 0 17654
1 23608
13022 0 2828
dtype: object
vals = s.index.get_level_values(0).to_series().append(s.astype(int)).unique()
df = pd.DataFrame({'ID':vals})
print (df)
ID
0 23682
1 21963
2 9711
3 21175
4 13022
5 1662
6 7399
7 13679
8 17654
9 23608
10 2828
Explanation : 说明 :
0
value by boolean indexing
boolean indexing
过滤掉所有非0
值 ID
column by set_index
ID
列按set_index
创建索引 [ ]
with strip
strip
尾随[ ]
Then get first level of MultiIndex
by get_level_values
and convert to_series
然后通过
get_level_values
获取MultiIndex
的第一级并转换为to_series
append
Series s
converted to integer
s append
系列s
转换为integer
s unique
values and last call DataFrame
contructor unique
值并最后调用DataFrame
构造DataFrame
These look like string representations of lists. 这些看起来像列表的字符串表示。 So you can use
ast.literal_eval
and itertools.chain
: 所以你可以使用
ast.literal_eval
和itertools.chain
:
from ast import literal_eval
from itertools import chain
s = df['ID_match'].astype(str).str.replace(' ', ',').apply(literal_eval)
L = list(chain.from_iterable(s[s != 0]))
res = pd.DataFrame({'ID': df.loc[df['ID_match'] != 0, 'ID'].tolist() + L})\
.drop_duplicates().reset_index(drop=True)
print(res)
ID
0 23682
1 21963
2 9711
3 21175
4 13022
5 1662
6 7399
7 13679
8 17654
9 23608
10 2828
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.