[英]How to split a pandas dataframe of different column sizes into separate dataframes?
I have a large pandas dataframe, consisting of a different number of columns throughout the dataframe.我有一个大的 pandas dataframe,由整个 dataframe 中不同数量的列组成。 Here is an example: Current dataframe example
这是一个示例:当前 dataframe 示例
I would like to split the dataframe into multiple dataframes, based on the number of columns it has.我想根据 dataframe 的列数将其拆分为多个数据帧。
Example output image here: Output image示例 output 图像在这里: Output 图像
Thanks.谢谢。
If you have a dataframe of say 10 columns and you want to put the records with 3 NaN
values in another result dataframe as those with 1 NaN
, you can do this as follows:如果您有一个 dataframe 比如说 10 列,并且您想将具有 3 个
NaN
值的记录放在另一个结果 dataframe 和具有 1 个NaN
的记录中,您可以按如下方式执行此操作:
# evaluate the number of NaNs per row
num_counts=df.isna().sum('columns')
# group by this number and add the grouped
# dataframe to a dictionary
results= dict()
num_counts=df.isna().sum('columns')
for key, sub_df in df.groupby(num_counts):
results[key]= sub_df
After executing this code, results contains subsets of df
where each subset contains the same number of NaN
s (so the same number of non- NaN
s).执行此代码后,结果包含
df
的子集,其中每个子集包含相同数量的NaN
(因此相同数量的非NaN
)。
If you want to write your results to a excel file, you just need to execute the following code:如果要将结果写入 excel 文件,只需执行以下代码:
with pd.ExcelWriter('sorted_output.xlsx') as writer:
for key, sub_df in results.items():
# if you want to avoid the detour of using dicitonaries
# just replace the previous line by
# for key, sub_df in df.groupby(num_counts):
sub_df.to_excel(
writer,
sheet_name=f'missing {key}',
na_rep='',
inf_rep='inf',
float_format=None,
index=True,
index_label=True,
header=True)
Example:例子:
# create an example dataframe
df=pd.DataFrame(dict(a=[1, 2, 3, 4, 5, 6], b=list('abbcac')))
df.loc[[2, 4, 5], 'c']= list('xyz')
df.loc[[2, 3, 4], 'd']= list('vxw')
df.loc[[1, 2], 'e']= list('qw')
It looks like this:它看起来像这样:
Out[58]:
a b c d e
0 1 a NaN NaN NaN
1 2 b NaN NaN q
2 3 b x v w
3 4 c NaN x NaN
4 5 a y w NaN
5 6 c z NaN NaN
If you execute the code above on this dataframe, you get a dictionary with the following content:如果你在这个 dataframe 上执行上面的代码,你会得到一个包含以下内容的字典:
0: a b c d e
2 3 b x v w
1: a b c d e
4 5 a y w NaN
2: a b c d e
1 2 b NaN NaN q
3 4 c NaN x NaN
5 6 c z NaN NaN
3: a b c d e
0 1 a NaN NaN NaN
The keys of the dictionary are the number of NaN
s in the row and the values are the dataframes which contain only rows with that number of NaN
s in them.字典的键是行中
NaN
的数量,值是数据帧,其中仅包含具有该数量NaN
的行。
If I'm getting you right, what you want to do is to split existing 1 dataframe with n
columns into ceil(n/5)
dataframes, each with 5 columns, and the last one with the reminder of n/5
.如果我说得对,您要做的是将现有的 1 dataframe 与
n
列拆分为ceil(n/5)
数据帧,每个数据帧有 5 列,最后一个带有n/5
的提醒。
If that's the case this will do the trick:如果是这种情况,这将起到作用:
import pandas as pd
import math
max_cols=5
dt={"a": [1,2,3], "b": [6,5,3], "c": [8,4,2], "d": [8,4,0], "e": [1,9,5], "f": [9,7,9]}
df=pd.DataFrame(data=dt)
dfs=[df[df.columns[max_cols*i:max_cols*i+max_cols]] for i in range(math.ceil(len(df.columns)/max_cols))]
for el in dfs:
print(el)
And output:和 output:
a b c d e
0 1 6 8 8 1
1 2 5 4 4 9
2 3 3 2 0 5
f
0 9
1 7
2 9
[Program finished]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.