简体   繁体   English

根据空行将Pandas数据框拆分为多个较小的数据框

[英]Split a Pandas Dataframe into multiple smaller dataframes based on empty rows

I have a csv file with a format like this: 我有一个csv文件,格式如下:

Header 1, Header 2, Header 3
''          ''        ''
value 1,  value2,   value 3
value 1,  value2,   value 3
value 1,  value2,   value 3
''          ''        ''
value 1,  value 2,   value 3
value 1,  value 2,   value 3
value 1,  value 2,   value 3
 ''          ''        ''

I can read it into a pandas dataframe but the segments surrounded by empty rows (denoted by '' ) need to be each processed individually. 我可以将其读入pandas数据框中,但是由空行(用''表示)包围的线段需要分别处理。 What would be the simplest way to divide them into smaller dataframes based off of them being between empty rows? 在空行之间将它们分成较小的数据框,最简单的方法是什么? I have quite a few of these segments to go through. 这些部分中我有很多要经历。

Would it be easier to divide them into smaller dataframes or would removing the segment from the original dataframe after processing it be even easier? 将它们分为较小的数据帧会更容易,还是在处理后从原始数据帧中删除该段会更容易?

EDIT: 编辑:

IanS's answer was correct but in my case some of my files had simply no quotes in empty rows so the type was not a string. IanS的答案是正确的,但就我而言,我的某些文件在空行中根本没有引号,因此类型不是字符串。 I modified his answer a little and this worked for them: 我稍微修改了他的答案,这对他们有用:

df['counter'] = (df['Header 1'].isnull()).cumsum()
df = df[df['Header 1'].isnull() == False]  # remove empty rows
df.groupby('counter').apply(lambda df: df.iloc[0])

I think you can find empty rows by str.contains , create counter series by cumsum , groupby by it and in loop you get small DataFrames : 我想你可以找到空行str.contains ,创建计数器seriescumsumgroupby通过它,在循环中,您获得小DataFrames

print df['Header 1'].str.contains("''").cumsum()
0    1
1    1
2    1
3    1
4    2
5    2
6    2
7    2
8    3
Name: Header 1, dtype: int32

for idx, group in df.groupby(df['Header 1'].str.contains("''").cumsum()):
    print idx
    print group[1:]
1
  Header 1  Header 2    Header 3
1  value 1    value2     value 3
2  value 1    value2     value 3
3  value 1    value2     value 3
2
  Header 1   Header 2    Header 3
5  value 1    value 2     value 3
6  value 1    value 2     value 3
7  value 1    value 2     value 3
3
Empty DataFrame
Columns: [Header 1,  Header 2,  Header 3]
Index: []

If you want, you can create dictionary of DataFrames : 如果需要,可以创建DataFrames字典:

dfs = {}
for idx, group in df.groupby(df['Header 1'].str.contains("''").cumsum()):
    dfs.update({idx:group[1:]})

The simplest would be to add a counter that increments each time it encounters an empty row. 最简单的方法是添加一个计数器,该计数器在遇到空行时递增。 You can then get your individual dataframes via groupby . 然后,您可以通过groupby获取您的单个数据帧。

df['counter'] = (df['Header1'] == "''").cumsum()
df = df[df['Header1'] != "''"]  # remove empty rows
df.groupby('counter').apply(lambda df: df.iloc[0])

The last line applies your processing function to each dataframe separately (I just put a dummy example). 最后一行将您的处理功能分别应用于每个数据框(我只是放置了一个虚拟示例)。

Note that the exact condition testing for empty rows (here df['Header1'] == "''" ) should be adapted to your exact situation. 请注意,对空行的精确条件测试(此处df['Header1'] == "''" )应适合您的实际情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM