为 Pandas 中的缺失数据组合添加值

Question

I've got a pandas data frame containing something like the following:我有一个 pandas 数据框，其中包含如下内容：

person_id   status    year    count
0           'pass'    1980    4
0           'fail'    1982    1
1           'pass'    1981    2

If I know that all possible values for each field are:如果我知道每个字段的所有可能值是：

all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]

I'd like to populate the original data frame with count=0 for missing data combinations (of person_id, status, and year), ie I'd like the new data frame to contain:我想用count=0填充原始数据框以用于缺少数据组合（person_id、status 和 year），即我希望新数据框包含：

person_id   status    year    count
0           'pass'    1980    4
0           'pass'    1981    0
0           'pass'    1982    0
0           'fail'    1980    0
0           'fail'    1981    0
0           'fail'    1982    2
1           'pass'    1980    0
1           'pass'    1981    2
1           'pass'    1982    0
1           'fail'    1980    0
1           'fail'    1981    0
1           'fail'    1982    0
2           'pass'    1980    0
2           'pass'    1981    0
2           'pass'    1982    0
2           'fail'    1980    0
2           'fail'    1981    0
2           'fail'    1982    0

Is there an efficient way to achieve this in pandas?在 pandas 中有没有一种有效的方法来实现这一点？

Answer 1

You can use itertools.product to generate all combinations, then construct a df from this, merge it with your original df along with fillna to fill missing count values with 0 :您可以使用itertools.product生成所有组合，然后从中构建一个 df， merge其与您的原始 df 以及fillna以使用0填充缺失的计数值：

In [77]:
import itertools
all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]
combined = [all_person_ids, all_statuses, all_years]
df1 = pd.DataFrame(columns = ['person_id', 'status', 'year'], data=list(itertools.product(*combined)))
df1

Out[77]:
    person_id status  year
0           0   pass  1980
1           0   pass  1981
2           0   pass  1982
3           0   fail  1980
4           0   fail  1981
5           0   fail  1982
6           1   pass  1980
7           1   pass  1981
8           1   pass  1982
9           1   fail  1980
10          1   fail  1981
11          1   fail  1982
12          2   pass  1980
13          2   pass  1981
14          2   pass  1982
15          2   fail  1980
16          2   fail  1981
17          2   fail  1982

In [82]:    
df1 = df1.merge(df, how='left').fillna(0)
df1

Out[82]:
    person_id status  year  count
0           0   pass  1980      4
1           0   pass  1981      0
2           0   pass  1982      0
3           0   fail  1980      0
4           0   fail  1981      0
5           0   fail  1982      1
6           1   pass  1980      0
7           1   pass  1981      2
8           1   pass  1982      0
9           1   fail  1980      0
10          1   fail  1981      0
11          1   fail  1982      0
12          2   pass  1980      0
13          2   pass  1981      0
14          2   pass  1982      0
15          2   fail  1980      0
16          2   fail  1981      0
17          2   fail  1982      0

Answer 2

create a MultiIndex by MultiIndex.from_product() and then set_index() , reindex() , reset_index() .通过 MultiIndex.from_product() 然后set_index() ， reindex() ， reset_index()创建一个 MultiIndex 。

import pandas as pd
import io

all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]
df = pd.read_csv(io.BytesIO("""person_id   status    year    count
0           pass    1980    4
0           fail    1982    1
1           pass    1981    2"""), delim_whitespace=True)
names = ["person_id", "status", "year"]

mind = pd.MultiIndex.from_product(
    [all_person_ids, all_statuses, all_years], names=names)
df.set_index(names).reindex(mind, fill_value=0).reset_index()

Answer 3

You can use pyjanitor 's complete method.您可以使用pyjanitor的complete方法。

It accepts column names as input as well as {name: values} dictionaries with the exhaustive list of wanted values to complete:它接受列名作为输入以及 {name: values} 字典，其中包含要完成的详尽列表：

import janitor
df.complete({'person_id': [0,1,2]}, 'status', 'year').fillna(0, downcast='infer')

output: output：

    person_id  status  year  count
0           0  'fail'  1980      0
1           0  'fail'  1981      0
2           0  'fail'  1982      1
3           0  'pass'  1980      4
4           0  'pass'  1981      0
5           0  'pass'  1982      0
6           1  'fail'  1980      0
7           1  'fail'  1981      0
8           1  'fail'  1982      0
9           1  'pass'  1980      0
10          1  'pass'  1981      2
11          1  'pass'  1982      0
12          2  'fail'  1980      0
13          2  'fail'  1981      0
14          2  'fail'  1982      0
15          2  'pass'  1980      0
16          2  'pass'  1981      0
17          2  'pass'  1982      0

Answer 4

all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]


pd.Series(all_person_ids).to_frame('person_id').merge(pd.Series(all_statuses).to_frame('status'), how='cross')\
    .merge(pd.Series(all_years).to_frame('year'), how='cross')\
    .merge(df1,on=['person_id','status','year'], how='left')\
    .fillna(0)

    person_id status  year  count
0           0   pass  1980    4.0
1           0   pass  1981    0.0
2           0   pass  1982    0.0
3           0   fail  1980    0.0
4           0   fail  1981    0.0
5           0   fail  1982    1.0
6           1   pass  1980    0.0
7           1   pass  1981    2.0
8           1   pass  1982    0.0
9           1   fail  1980    0.0
10          1   fail  1981    0.0
11          1   fail  1982    0.0
12          2   pass  1980    0.0
13          2   pass  1981    0.0
14          2   pass  1982    0.0
15          2   fail  1980    0.0
16          2   fail  1981    0.0
17          2   fail  1982    0.0

为 Pandas 中的缺失数据组合添加值

问题描述

4 个解决方案

解决方案1
9 2015-08-03 12:16:11

解决方案2
8 已采纳 2015-08-03 12:20:37

解决方案3
2 2022-04-05 08:50:06

解决方案4
1 2022-11-25 08:29:48

为 Pandas 中的缺失数据组合添加值

问题描述

4 个解决方案

解决方案1 9 2015-08-03 12:16:11

解决方案2 8 已采纳 2015-08-03 12:20:37

解决方案3 2 2022-04-05 08:50:06

解决方案4 1 2022-11-25 08:29:48

解决方案1
9 2015-08-03 12:16:11

解决方案2
8 已采纳 2015-08-03 12:20:37

解决方案3
2 2022-04-05 08:50:06

解决方案4
1 2022-11-25 08:29:48