如何用零替换不平衡数据框中的缺失值？

Question

I have a dataframe with with two dimensions, A and B;我有一个 dataframe 有两个维度，A 和 B； however this data frame is unbalanced in that some of the values are missing because the database does not include values for all possible combinations of A and B. What I want to do is make sure the dataframe is balanced, and those elements that are missing to be filled with zeros.但是，此数据框是不平衡的，因为数据库不包含 A 和 B 的所有可能组合的值，因此缺少某些值。我要做的是确保 dataframe 是平衡的，并且那些缺少的元素用零填充。

I am grabbing the data for the dataframe from an sqlite database that I'm conneting to via SQLAlchemy, using the following code我正在使用以下代码从 sqlite 数据库中获取 dataframe 的数据，我通过 SQLAlchemy 连接到该数据库

connection = sqlite3.connect("my_database.db")
cursor = connection.cursor()
# set up a query
cursor.execute('SELECT factorA, factorB,  COUNT(*) as unique_driver_counts FROM block_optimizer_runs GROUP BY 1, 2')
results = cursor.fetchall()
results_df = pd.DataFrame(results, columns=['factorA', 'factorB', 'count'])

But the database doesn't include data for all possible combinations of factorA, factorB, and factorC.但该数据库不包含所有可能的因素 A、因素 B 和因素 C 组合的数据。 When these values do not exist, the database returns no value;当这些值不存在时，数据库不返回值； but in the dataframe I need these 'missing' values to be filled with a zero.但在 dataframe 中，我需要用零填充这些“缺失”值。

For example例如

import pandas as pd
data = [['dog', 'house', 1], ['dog', 'apartment', 2], ['dog', 'trailer', 1], ['dog', 'cabin', 0], ['dog', 'shack', 1],
['cat', 'house', 3], ['cat', 'apartment', 1], ['cat', 'trailer', 0], ['cat', 'shack', 3],
['gecko', 'apartment', 3], ['gecko', 'trailer', 2], ['gecko', 'shack', 0] ]
df = pd.DataFrame(data, columns = ['factorA', 'factorB', 'count'])
df

but what I want is但我想要的是

import pandas as pd
data = [['dog', 'house', 1], ['dog', 'apartment', 2], ['dog', 'trailer', 1], ['dog', 'cabin', 0], ['dog', 'shack', 1],
['cat', 'house', 3], ['cat', 'apartment', 1], ['cat', 'trailer', 0], ['cat', 'cabin', 0], ['cat', 'shack', 3],
['gecko', 'house', 0], ['gecko', 'apartment', 3], ['gecko', 'trailer', 2], ['gecko', 'cabin', 0], ['gecko', 'shack', 0]]
df = pd.DataFrame(data, columns = ['factorA', 'factorB', 'count'])
df

Can anyone help me figure out how to do this for an arbitrary dataset which may include more than two factors?谁能帮我弄清楚如何对可能包含两个以上因素的任意数据集执行此操作？

Answer 1

Use DataFrame.set_index with DataFrame.reindex and MultiIndex.from_product :将DataFrame.set_index与DataFrame.reindex和MultiIndex.from_product一起使用：

df = df.set_index(['factorA','factorB'])
df = df.reindex(pd.MultiIndex.from_product(df.index.levels), fill_value=0).reset_index()
print (df)
   factorA    factorB  count
0      cat  apartment      1
1      cat      cabin      0
2      cat      house      3
3      cat      shack      3
4      cat    trailer      0
5      dog  apartment      2
6      dog      cabin      0
7      dog      house      1
8      dog      shack      1
9      dog    trailer      1
10   gecko  apartment      3
11   gecko      cabin      0
12   gecko      house      0
13   gecko      shack      0
14   gecko    trailer      2

Or with Series.unstack for add 0 with DataFrame.stack :或者使用Series.unstack使用DataFrame.stack添加0 ：

df = (df.set_index(['factorA','factorB'])['count']
         .unstack(fill_value=0)
         .stack()
         .reset_index(name='count'))

Answer 2

One way could be to use df.pivot :一种方法是使用df.pivot ：

In [1862]: res = df.pivot('factorA', 'factorB').fillna(0).astype(int).stack().reset_index()

In [1863]: res
Out[1863]: 
   factorA    factorB  count
0      cat  apartment      1
1      cat      cabin      0
2      cat      house      3
3      cat      shack      3
4      cat    trailer      0
5      dog  apartment      2
6      dog      cabin      0
7      dog      house      1
8      dog      shack      1
9      dog    trailer      1
10   gecko  apartment      3
11   gecko      cabin      0
12   gecko      house      0
13   gecko      shack      0
14   gecko    trailer      2

如何用零替换不平衡数据框中的缺失值？

问题描述

2 个解决方案

解决方案1
3 2021-04-08 06:20:50

解决方案2
2 已采纳 2021-04-08 06:50:53

如何用零替换不平衡数据框中的缺失值？

问题描述

2 个解决方案

解决方案1 3 2021-04-08 06:20:50

解决方案2 2 已采纳 2021-04-08 06:50:53

解决方案1
3 2021-04-08 06:20:50

解决方案2
2 已采纳 2021-04-08 06:50:53