简体   繁体   English

为 Pandas DataFrame 中的给定索引值组合连续行

[英]Combine Consecutive Rows for given index values in Pandas DataFrame

I was extracting tables from a PDF with tabula-py.我正在使用 tabula-py 从 PDF 中提取表格。 But in a table where some rows were more than one line, but in tabula-py, a single-table row is converted as multiple rows in DataFrame.但是在某些行不止一行的表中,但在 tabula-py 中,单表行在 DataFrame 中被转换为多行。 I'm giving a sample here.我在这里提供一个样本。

    Serial No.  Name    Type    Total
0   1   Easter         Multiple    19   
1   2   Costeri        Roundabout  16   
2   3   Zhiop            Tee       16   
3   4   Nesss           Cross      10   
4   5   Uoar Lhahara    Tee        10   
5   6   Trino Nishra (KX) Tee       9   
6   7   Old-FX Box      Cross       8
7   8   Gardeners    Roundabout     8   
8   9   Max Detter   Roundabout     7   
9   NaN Others (Asynco, NaN        NaN  
10  10  D+ E,           Cross       7   
11  NaN etc)            NaN        NaN  

If you look at the sample you will see that rows in 9, 10, and 11 indices are actually a single row.如果您查看示例,您将看到 9、10 和 11 索引中的行实际上是单行。 There was multiple line in the table (in pdf).表中有多行(pdf 格式)。 This table has more than 100 rows and at least 12 places those issues have occurred.该表有 100 多行,至少有 12 个地方出现了这些问题。 Some places it is 2 consecutive rows and in some places it is 3 consecutive rows.有些地方是连续 2 行,有些地方是连续 3 行。 How can we merge those rows with index values?我们如何将这些行与索引值合并?

You can try:你可以试试:

df['Serial No.'] = df['Serial No.'].bfill().ffill()
df['Total'] = df['Total'].astype(str).replace('nan', np.nan)

df_out = df.groupby('Serial No.', as_index=False).agg(lambda x: ''.join(x.dropna()))
df_out['Total'] = df_out['Total'].replace('', np.nan, regex=True).astype(float)

Result:结果:

print(df_out)

   Serial No.                      Name        Type  Total
0         1.0                    Easter    Multiple   19.0
1         2.0                   Costeri  Roundabout   16.0
2         3.0                     Zhiop         Tee   16.0
3         4.0                     Nesss       Cross   10.0
4         5.0              Uoar Lhahara         Tee   10.0
5         6.0          Trino Nishra(KX)         Tee    9.0
6         7.0                Old-FX Box       Cross    8.0
7         8.0                 Gardeners  Roundabout    8.0
8         9.0                Max Detter  Roundabout    7.0
9        10.0  Others (Asynco,D+ E,etc)       Cross    7.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM