[英]How can I add rows to Dataframe from a list in pandas?
I have a yearly information (COUNT) of countries stored in DataFrame.我有一个存储在 DataFrame 中的国家/地区的年度信息 (COUNT)。 However, some countries are missing in certain years.然而,某些国家在某些年份失踪了。
If I have a complete list of countries, what is an optimal way to add them under corresponding years and fill the missing value for COUNT with 0?如果我有完整的国家/地区列表,将它们添加到相应年份并用 0 填充 COUNT 的缺失值的最佳方法是什么?
DATE COUNTRY COUNTRY_ID COUNT
0 1980 United States 840 42
42 1980 Czech Republic 203 2
95 1980 Hungary 348 1
96 1980 Great Britain 826 1
97 1980 South Africa 710 1
98 1982 United States 840 42
140 1982 Paraguay 600 2
.
.
One way to do this is to make a combination of all the DATE, COUNTRY combinations and then reindex
the DataFrame and finally fill in the missing values.一种方法是组合所有 DATE、COUNTRY 组合,然后reindex
DataFrame,最后填充缺失值。
# Assume that we want all years not just the ones seen
years = range(df['DATE'].min(), df['DATE'].max()+1)
# get all combinations
idx = pd.MultiIndex.from_product([years, df['COUNTRY'].unique()], names=['DATE', 'COUNTRY'])
# reindex by first putting DATE and COUNTRY into the index
df1 = df.set_index(['DATE', 'COUNTRY']).reindex(idx).reset_index()
# Fill back in missing IDs
country_map = df.set_index('COUNTRY')['COUNTRY_ID'].drop_duplicates()
df1['COUNTRY_ID'] = df1.COUNTRY.map(country_map)
# fill in 0 for COUNT and convert back to int
df1['COUNT'] = df1['COUNT'].fillna(0).astype(int)
DATE COUNTRY COUNTRY_ID COUNT
0 1980 United States 840 42
1 1980 Czech Republic 203 2
2 1980 Hungary 348 1
3 1980 Great Britain 826 1
4 1980 South Africa 710 1
5 1980 Paraguay 600 0
6 1981 United States 840 0
7 1981 Czech Republic 203 0
8 1981 Hungary 348 0
9 1981 Great Britain 826 0
10 1981 South Africa 710 0
11 1981 Paraguay 600 0
12 1982 United States 840 42
13 1982 Czech Republic 203 0
14 1982 Hungary 348 0
15 1982 Great Britain 826 0
16 1982 South Africa 710 0
17 1982 Paraguay 600 2
Consider also a cross join merge
route (for those of us with the SQL mindset)还考虑一个交叉连接merge
路线(对于我们这些有 SQL 思维的人)
# ASSIGN KEY COLUMN
df['KEY'] = 1
# CREATE DF OF DATES RANGE
dates = pd.DataFrame({'DATE':list(range(df['DATE'].min(),df['DATE'].max() + 1)),
'COUNT':0, 'KEY':1})
# CROSS JOIN MERGE
mdf = df.merge(dates, on=['KEY'])
# REASSIGN COUNT
mdf.loc[mdf['DATE_x'] != mdf['DATE_y'], 'COUNT_x'] = 0
# CLEAN UP DF (COLS AND ROWS)
mdf = mdf[['DATE_y', 'COUNTRY', 'COUNTRY_ID', 'COUNT_x']]\
.rename(columns={'DATE_y':'DATE', 'COUNT_x':'COUNT'})\
.drop_duplicates(['DATE', 'COUNTRY', 'COUNTRY_ID'])\
.sort_values('DATE')\
.reset_index(drop=True)
# DATE COUNTRY COUNTRY_ID COUNT
# 0 1980 United States 840 42
# 1 1980 Paraguay 600 0
# 2 1980 Czech Republic 203 2
# 3 1980 Hungary 348 1
# 4 1980 Great Britain 826 1
# 5 1980 South Africa 710 1
# 6 1981 United States 840 0
# 7 1981 Czech Republic 203 0
# 8 1981 Hungary 348 0
# 9 1981 Paraguay 600 0
# 10 1981 Great Britain 826 0
# 11 1981 South Africa 710 0
# 12 1982 South Africa 710 0
# 13 1982 Hungary 348 0
# 14 1982 Czech Republic 203 0
# 15 1982 United States 840 0
# 16 1982 Great Britain 826 0
# 17 1982 Paraguay 600 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.