根据重复的str值的索引合并数据帧中的str值？

Question

I researched my problem and I can't seem to find a solution. 我研究了我的问题，但似乎找不到解决方案。 I am trying to transfer a large pdf document to an excel table. 我正在尝试将大型pdf文档传输到excel表。 When I extract the data to a table, it reads as follows (extracted table): 当我将数据提取到表中时，其内容如下（提取表）：

+---------------+-------+----------+
|    details    | text |  volume  |
+---------------+-------+----------+
| 2018-001 - 01 | text1 | Vol. 1   |
| Public        | text1 | pp. 1-13 |
| PDF No.1      | text1 |          |
|               | text1 |          |
| 2018-001 - 02 | text2 | Vol. 1   |
| Public        | text2 | pp. 1-46 |
| PDF No.2      | text2 |          |
| 2018-001 - 03 | text3 | Vol. 1.1 |
| Public        | text3 | pp. 1-47 |
| PDF No.3      | text3 |          |
+---------------+-------+----------+

IF a value in column 1 starts with "2018-001", then I want to group all the rest of the values into one row, until I reach the next "2018-001", as in the Desired Result table in my example. 如果第1列中的值以“ 2018-001”开头，那么我想将所有其余值分组为一行，直到到达下一个“ 2018-001”，如本例中的“所需结果”表中所示。 I greatly appreciate any help, I am new to pandas and I'm trying to find a solution - Thank you. 非常感谢您的帮助，我是熊猫的新手，我正在寻找解决方案-谢谢。 I will post my code as I go if I make some progress 如果取得一些进展，我会在发布过程中发布代码

desired table: 所需表：

+-------------------------------+----------------+-------------------+
|            details            |     text      |      volume       |
+-------------------------------+----------------+-------------------+
| 2018-001 - 01 Public PDF No.1 | text1 (joined) | Vol. 1 pp. 1-13   |
| 2018-001 - 02 Public PDF No.2 | text2 (joined) | Vol. 1 pp. 1-46   |
| 2018-001 - 03 Public PDF No.3 | text3 (joined) | Vol. 1.1 pp. 1-47 |
+-------------------------------+----------------+-------------------+

Answer 1

When people ask for text it's so that they can work on your data. 当人们要求输入文本时，是为了他们可以处理您的数据。 They want something like data = pd.DataFrame(...) , not ASCII art (although it does help to show what you'd like to accomplish, so it's not useless). 他们想要的是data = pd.DataFrame(...)类的东西，而不是ASCII艺术（尽管它确实有助于显示您想要完成的事情，因此它并非没有用）。

import pandas as pd
import numpy as np

data = pd.DataFrame(...)
slice_idxes = np.where(data['details'].str.contains('2018-001'))[0].tolist() + [data.shape[0]]

new_data = pd.DataFrame(columns=data.columns)

def idx_gen(idx_list):
    for i in range(len(idx_list) - 1):
        yield idx_list[i], idx_list[i+1]

for start, stop in idx_gen(slice_idxes):
    new_row = data.iloc[start:stop, :]
    new_row = new_row.apply(lambda x: x.str.cat(sep=" ")).to_frame().transpose()
    new_data = new_data.append(new_row)

This isn't very fast or efficient but it should do the job. 这不是很快或有效，但它应该可以完成任务。

根据重复的str值的索引合并数据帧中的str值？

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-12-05 16:19:57

根据重复的str值的索引合并数据帧中的str值？

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-12-05 16:19:57

解决方案1
0 已采纳 2018-12-05 16:19:57