[英]How to concatenate / join these three dataframes
I have three dataframes df_Male , df_female , Df_TransGender 我有三个数据框df_Male,df_female,Df_TransGender
sample dataframe 样本数据框
df_Male
continent avg_count_country avg_age
Asia 55 5
Africa 65 10
Europe 75 8
df_Female
continent avg_count_country avg_age
Asia 50 7
Africa 60 12
Europe 70 0
df_Transgender
continent avg_count_country avg_age
Asia 30 6
Africa 40 11
America 80 10
Now I am concatenating like this below 现在我在下面这样连接
frames = [df_Male, df_Female, df_Transgender]
df = pd.concat(frames, keys=['Male', 'Female', 'Transgender'])
As you can see America
is present in df_transgender
, same wise Europe is present in df_Male
and df_Female
正如你所看到的
America
是目前在df_transgender
,同样明智的欧洲存在于df_Male
和df_Female
So I have to concat it in a way so that it looks like below but not manual as there can be huge number of rows 所以我必须以某种方式进行合并,使其看起来像下面,但不是手动的,因为可能存在大量行
continent avg_count_country avg_age
Male 0 Asia 55 5
1 Africa 65 10
2 Europe 75 8
3 America 0 0
Female 0 Asia 50 7
1 Africa 60 12
2 Europe 70 0
3 America 0 0
Transgender 0 Asia 30 6
1 Africa 40 11
2 America 80 10
3 Europe 0 0
So for other continent
values avg_count_country
and avg_age
should be 0 因此,对于其他
continent
值, avg_count_country
和avg_age
应该为0
You can add a "Gender" column before concatenating. 您可以在连接前添加“性别”列。
We use Categorical Data with groupby
to calculate the Cartesian product. 我们将分类数据与
groupby
一起使用以计算笛卡尔乘积。 This should also yield performance benefits. 这还将产生性能优势。
df = pd.concat([df_Male.assign(gender='Male'),
df_Female.assign(gender='Female'),
df_Transgender.assign(gender='Transgender')])
for col in ['gender', 'continent']:
df[col] = df[col].astype('category')
res = df.groupby(['gender', 'continent']).first().fillna(0).astype(int)
print(res)
avg_count_country avg_age
gender continent
Female Africa 60 12
America 0 0
Asia 50 7
Europe 70 0
Male Africa 65 10
America 0 0
Asia 55 5
Europe 75 8
Transgender Africa 40 11
America 80 10
Asia 30 6
Europe 0 0
You can reindex a bit. 您可以重新索引一下。
from itertools import product
# Get rid of that number in the index, not sure why you'd need it
df.index = df.index.droplevel(-1)
# Add continents to the index
df = df.set_index('continent', append=True)
# Determine product of indices
ids = list(product(df.index.get_level_values(0).unique(), df.index.get_level_values(1).unique()))
# Reindex and fill missing with 0
df = df.reindex(ids).fillna(0).reset_index(level=-1)
df
is now: df
现在是:
continent avg_count_country avg_age
Male Asia 55.0 5.0
Male Africa 65.0 10.0
Male Europe 75.0 8.0
Male America 0.0 0.0
Female Asia 50.0 7.0
Female Africa 60.0 12.0
Female Europe 70.0 0.0
Female America 0.0 0.0
Transgender Asia 30.0 6.0
Transgender Africa 40.0 11.0
Transgender Europe 0.0 0.0
Transgender America 80.0 10.0
If you want that other numeric index, then you can just do: df.groupby(df.index).cumcount()
to number the values in each group. 如果需要其他数字索引,则可以执行以下操作:
df.groupby(df.index).cumcount()
对每个组中的值进行编号。
Making use of DataFrame.pivot
, a slight modification to @jpp's answer allows you to avoid having to manually manipulate indices: 利用
DataFrame.pivot
,对@jpp的答案稍作修改,就可以避免手动操作索引:
df = pd.concat([df_Male.assign(gender='Male'),
df_Female.assign(gender='Female'),
df_Transgender.assign(gender='Transgender')])
df.pivot('gender', 'continent').fillna(0).stack().astype(int)
avg_count_country avg_age
gender continent
Female Africa 60 12
America 0 0
Asia 50 7
Europe 70 0
Male Africa 65 10
America 0 0
Asia 55 5
Europe 75 8
Transgender Africa 40 11
America 80 10
Asia 30 6
Europe 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.