[英]How to merge consecutive rows whose columns are NaN
I have this kind of data that it's driving me crazy. 我有这样的数据,这让我发疯。 The source is a pdf file that I read with tabula to extract tables.
来源是我用表格阅读的pdf文件以提取表格。 Problem is that some rows of the table are multiline in the document and this is how I see my output.
问题是表格的某些行在文档中是多行,这就是我看到输出的方式。
> sub_df.iloc[85:95]
1 Acronym Meaning
86 ABC Aaaaa Bbbbb Ccccc
87 CDE Ccccc Ddddd Eeeee
88 NaN Fffff Ggggg
89 FGH NaN
90 NaN Hhhhh
91 IJK Iiiii Jjjjj Kkkkk
92 LMN Lllll Mmmmm Nnnnn
93 OPQ Ooooo Ppppp Qqqqq
94 RST Rrrrr Sssss Ttttt
95 UVZ Uuuuu Vvvvv Zzzzz
What I would like to get is something like this. 我想得到的是这样的东西。
> sub_df.iloc[85:95]
1 Acronym Meaning
86 ABC Aaaaa Bbbbb Ccccc
87 CDE Ccccc Ddddd Eeeee
88 FGH Fffff Ggggg Hhhhh
91 IJK Iiiii Jjjjj Kkkkk
92 LMN Lllll Mmmmm Nnnnn
93 OPQ Ooooo Ppppp Qqqqq
94 RST Rrrrr Sssss Ttttt
95 UVZ Uuuuu Vvvvv Zzzzz
I am struggling with combine_first like this: 我正在这样努力与combin_first :
sub_df.iloc[[88]].combine_first(sub_df.iloc[[87]])
but the result is not what I am expecting. 但是结果却不是我所期望的。
Also a solution with groupby would be appreciated. 还可以理解使用groupby的解决方案。
Note: index is not important and it can be reset. 注意:索引并不重要,可以重设。 I just wanna join some consecutive rows whose columns are NaN and then dump it to csv, so I don't need them.
我只想加入一些连续的行,其列为NaN,然后将其转储到csv,所以我不需要它们。
Let's try this: 让我们尝试一下:
df = df.assign(Meaning = df['Meaning'].ffill())
mask = ~((df.Meaning.duplicated(keep='last')) & df.Acronym.isnull())
df = df[mask]
df = df.assign(Acronym = df['Acronym'].ffill())
df_out = df.groupby('Acronym').apply(lambda x: ' '.join(x['Meaning'].str.split('\s').sum())).reset_index()
Output: 输出:
Acronym 0
0 ABC Aaaaa Bbbbb Ccccc
1 CDE Ccccc Ddddd Eeeee
2 FGH Fffff Ggggg Hhhhh
3 IJK Iiiii Jjjjj Kkkkk
4 LMN Lllll Mmmmm Nnnnn
5 OPQ Ooooo Ppppp Qqqqq
6 RST Rrrrr Sssss Ttttt
7 UVZ Uuuuu Vvvvv Zzzzz
This is a pretty tricky question neither ffill
and bfill
will work for this question 这是一个非常棘手的问题,
ffill
和bfill
适用于此问题
s1=(~(df.Acronym.isnull()|df.Meaning.isnull())) # create the group
s=s1.astype(int).diff().ne(0).cumsum() # create the group for each bad line it will assign the single id
bad=df[~s1]# we just only change the bad one
good=df[s1]# keep the good one no change
bad=bad.groupby(s.loc[bad.index]).agg({'1':'first','Acronym':'first','Meaning':lambda x : ''.join(x[x.notnull()])})
pd.concat([good,bad]).sort_index()
Out[107]:
1 Acronym Meaning
0 86 ABC Aaaaa Bbbbb Ccccc
1 87 CDE Ccccc Ddddd Eeeee
2 88 FGH Fffff Ggggg Hhhhh
5 91 IJK Iiiii Jjjjj Kkkkk
6 92 LMN Lllll Mmmmm Nnnnn
7 93 OPQ Ooooo Ppppp Qqqqq
8 94 RST Rrrrr Sssss Ttttt
9 95 UVZ Uuuuu Vvvvv Zzzzz
Here is an approach using numpy.where
to do a conditional fill: 这是使用
numpy.where
进行条件填充的方法:
df['Acronym'] = np.where(df[['Acronym']].assign(Meaning=df.Meaning.shift()).isna().all(1),
df.Acronym.ffill(),
df.Acronym.bfill())
clean_meaning = df.dropna().groupby('Acronym')['Meaning'].apply(lambda x : ' '.join(x)).to_frame()
df_new = (df[['1', 'Acronym']]
.drop_duplicates(subset=['Acronym'])
.merge(clean_meaning,
left_on='Acronym',
right_index=True))
[out]
1 Acronym Meaning
0 86 ABC Aaaaa Bbbbb Ccccc
1 87 CDE Ccccc Ddddd Eeeee
2 88 FGH Fffff Ggggg Hhhhh
5 91 IJK Iiiii Jjjjj Kkkkk
6 92 LMN Lllll Mmmmm Nnnnn
7 93 OPQ Ooooo Ppppp Qqqqq
8 94 RST Rrrrr Sssss Ttttt
9 95 UVZ Uuuuu Vvvvv Zzzzz
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.