为 Pandas DataFrame 中的给定索引值组合连续行

Question

I was extracting tables from a PDF with tabula-py.我正在使用 tabula-py 从 PDF 中提取表格。 But in a table where some rows were more than one line, but in tabula-py, a single-table row is converted as multiple rows in DataFrame.但是在某些行不止一行的表中，但在 tabula-py 中，单表行在 DataFrame 中被转换为多行。 I'm giving a sample here.我在这里提供一个样本。

    Serial No.  Name    Type    Total
0   1   Easter         Multiple    19   
1   2   Costeri        Roundabout  16   
2   3   Zhiop            Tee       16   
3   4   Nesss           Cross      10   
4   5   Uoar Lhahara    Tee        10   
5   6   Trino Nishra (KX) Tee       9   
6   7   Old-FX Box      Cross       8
7   8   Gardeners    Roundabout     8   
8   9   Max Detter   Roundabout     7   
9   NaN Others (Asynco, NaN        NaN  
10  10  D+ E,           Cross       7   
11  NaN etc)            NaN        NaN

If you look at the sample you will see that rows in 9, 10, and 11 indices are actually a single row.如果您查看示例，您将看到 9、10 和 11 索引中的行实际上是单行。 There was multiple line in the table (in pdf).表中有多行（pdf 格式）。 This table has more than 100 rows and at least 12 places those issues have occurred.该表有 100 多行，至少有 12 个地方出现了这些问题。 Some places it is 2 consecutive rows and in some places it is 3 consecutive rows.有些地方是连续 2 行，有些地方是连续 3 行。 How can we merge those rows with index values?我们如何将这些行与索引值合并？

Answer 1

You can try:你可以试试：

df['Serial No.'] = df['Serial No.'].bfill().ffill()
df['Total'] = df['Total'].astype(str).replace('nan', np.nan)

df_out = df.groupby('Serial No.', as_index=False).agg(lambda x: ''.join(x.dropna()))
df_out['Total'] = df_out['Total'].replace('', np.nan, regex=True).astype(float)

Result:结果：

print(df_out)

   Serial No.                      Name        Type  Total
0         1.0                    Easter    Multiple   19.0
1         2.0                   Costeri  Roundabout   16.0
2         3.0                     Zhiop         Tee   16.0
3         4.0                     Nesss       Cross   10.0
4         5.0              Uoar Lhahara         Tee   10.0
5         6.0          Trino Nishra(KX)         Tee    9.0
6         7.0                Old-FX Box       Cross    8.0
7         8.0                 Gardeners  Roundabout    8.0
8         9.0                Max Detter  Roundabout    7.0
9        10.0  Others (Asynco,D+ E,etc)       Cross    7.0

为 Pandas DataFrame 中的给定索引值组合连续行

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-06-30 19:27:08

为 Pandas DataFrame 中的给定索引值组合连续行

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-06-30 19:27:08

解决方案1
2 已采纳 2021-06-30 19:27:08