[英]Combine Consecutive Rows for given index values in Pandas DataFrame
I was extracting tables from a PDF with tabula-py.我正在使用 tabula-py 从 PDF 中提取表格。 But in a table where some rows were more than one line, but in tabula-py, a single-table row is converted as multiple rows in DataFrame.
但是在某些行不止一行的表中,但在 tabula-py 中,单表行在 DataFrame 中被转换为多行。 I'm giving a sample here.
我在这里提供一个样本。
Serial No. Name Type Total
0 1 Easter Multiple 19
1 2 Costeri Roundabout 16
2 3 Zhiop Tee 16
3 4 Nesss Cross 10
4 5 Uoar Lhahara Tee 10
5 6 Trino Nishra (KX) Tee 9
6 7 Old-FX Box Cross 8
7 8 Gardeners Roundabout 8
8 9 Max Detter Roundabout 7
9 NaN Others (Asynco, NaN NaN
10 10 D+ E, Cross 7
11 NaN etc) NaN NaN
If you look at the sample you will see that rows in 9, 10, and 11 indices are actually a single row.如果您查看示例,您将看到 9、10 和 11 索引中的行实际上是单行。 There was multiple line in the table (in pdf).
表中有多行(pdf 格式)。 This table has more than 100 rows and at least 12 places those issues have occurred.
该表有 100 多行,至少有 12 个地方出现了这些问题。 Some places it is 2 consecutive rows and in some places it is 3 consecutive rows.
有些地方是连续 2 行,有些地方是连续 3 行。 How can we merge those rows with index values?
我们如何将这些行与索引值合并?
You can try:你可以试试:
df['Serial No.'] = df['Serial No.'].bfill().ffill()
df['Total'] = df['Total'].astype(str).replace('nan', np.nan)
df_out = df.groupby('Serial No.', as_index=False).agg(lambda x: ''.join(x.dropna()))
df_out['Total'] = df_out['Total'].replace('', np.nan, regex=True).astype(float)
Result:结果:
print(df_out)
Serial No. Name Type Total
0 1.0 Easter Multiple 19.0
1 2.0 Costeri Roundabout 16.0
2 3.0 Zhiop Tee 16.0
3 4.0 Nesss Cross 10.0
4 5.0 Uoar Lhahara Tee 10.0
5 6.0 Trino Nishra(KX) Tee 9.0
6 7.0 Old-FX Box Cross 8.0
7 8.0 Gardeners Roundabout 8.0
8 9.0 Max Detter Roundabout 7.0
9 10.0 Others (Asynco,D+ E,etc) Cross 7.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.