I have the following dataframe (tab file with 2 columns-str) :
id1 id2
g1 ID:05434
g1 ID:05434
g1 NaN
g1 ID:05434|ID:38720|ID:33345
After doing
df1 = df[df['id2'].notnull()]
df2 = df1.drop_duplicates(['id1','id2'])
I got df2,
id1 id2
g1 ID:05434
g1 ID:05434|ID:38720|ID:33345
I am aiming to expand this to make it only 2 columns, say
id1 id2
g1 ID:05434
g1 ID:05434
g1 ID:38720
g1 ID:33345
Is there any expand function for this ?
Thanks in advance.
Use str.split
with stack
, also for remove NaN
s is used DataFrame.dropna
.
EDIT: By OP comment was removed duplicated in the end with sorting values:
df2 = (df.dropna(subset=['id2'])
.set_index('id1')['id2']
.str.split('|', expand=True)
.stack()
.reset_index(level=1, drop=True)
.reset_index(name='id2')
.sort_values(by=['col1', 'col2'])
.drop_duplicates(['col1','col2']))
print (df2)
id1 id2
0 g1 ID:05434
2 g1 ID:38720
3 g1 ID:33345
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.