[英]How can I replace missing values from an unbalanced data frame with zeros?
I have a dataframe with with two dimensions, A and B;我有一个 dataframe 有两个维度,A 和 B; however this data frame is unbalanced in that some of the values are missing because the database does not include values for all possible combinations of A and B. What I want to do is make sure the dataframe is balanced, and those elements that are missing to be filled with zeros.但是,此数据框是不平衡的,因为数据库不包含 A 和 B 的所有可能组合的值,因此缺少某些值。我要做的是确保 dataframe 是平衡的,并且那些缺少的元素用零填充。
I am grabbing the data for the dataframe from an sqlite database that I'm conneting to via SQLAlchemy, using the following code我正在使用以下代码从 sqlite 数据库中获取 dataframe 的数据,我通过 SQLAlchemy 连接到该数据库
connection = sqlite3.connect("my_database.db")
cursor = connection.cursor()
# set up a query
cursor.execute('SELECT factorA, factorB, COUNT(*) as unique_driver_counts FROM block_optimizer_runs GROUP BY 1, 2')
results = cursor.fetchall()
results_df = pd.DataFrame(results, columns=['factorA', 'factorB', 'count'])
But the database doesn't include data for all possible combinations of factorA, factorB, and factorC.但该数据库不包含所有可能的因素 A、因素 B 和因素 C 组合的数据。 When these values do not exist, the database returns no value;当这些值不存在时,数据库不返回值; but in the dataframe I need these 'missing' values to be filled with a zero.但在 dataframe 中,我需要用零填充这些“缺失”值。
For example例如
import pandas as pd
data = [['dog', 'house', 1], ['dog', 'apartment', 2], ['dog', 'trailer', 1], ['dog', 'cabin', 0], ['dog', 'shack', 1],
['cat', 'house', 3], ['cat', 'apartment', 1], ['cat', 'trailer', 0], ['cat', 'shack', 3],
['gecko', 'apartment', 3], ['gecko', 'trailer', 2], ['gecko', 'shack', 0] ]
df = pd.DataFrame(data, columns = ['factorA', 'factorB', 'count'])
df
but what I want is但我想要的是
import pandas as pd
data = [['dog', 'house', 1], ['dog', 'apartment', 2], ['dog', 'trailer', 1], ['dog', 'cabin', 0], ['dog', 'shack', 1],
['cat', 'house', 3], ['cat', 'apartment', 1], ['cat', 'trailer', 0], ['cat', 'cabin', 0], ['cat', 'shack', 3],
['gecko', 'house', 0], ['gecko', 'apartment', 3], ['gecko', 'trailer', 2], ['gecko', 'cabin', 0], ['gecko', 'shack', 0]]
df = pd.DataFrame(data, columns = ['factorA', 'factorB', 'count'])
df
Can anyone help me figure out how to do this for an arbitrary dataset which may include more than two factors?谁能帮我弄清楚如何对可能包含两个以上因素的任意数据集执行此操作?
Use DataFrame.set_index
with DataFrame.reindex
and MultiIndex.from_product
:将DataFrame.set_index
与DataFrame.reindex
和MultiIndex.from_product
一起使用:
df = df.set_index(['factorA','factorB'])
df = df.reindex(pd.MultiIndex.from_product(df.index.levels), fill_value=0).reset_index()
print (df)
factorA factorB count
0 cat apartment 1
1 cat cabin 0
2 cat house 3
3 cat shack 3
4 cat trailer 0
5 dog apartment 2
6 dog cabin 0
7 dog house 1
8 dog shack 1
9 dog trailer 1
10 gecko apartment 3
11 gecko cabin 0
12 gecko house 0
13 gecko shack 0
14 gecko trailer 2
Or with Series.unstack
for add 0
with DataFrame.stack
:或者使用Series.unstack
使用DataFrame.stack
添加0
:
df = (df.set_index(['factorA','factorB'])['count']
.unstack(fill_value=0)
.stack()
.reset_index(name='count'))
One way could be to use df.pivot
:一种方法是使用df.pivot
:
In [1862]: res = df.pivot('factorA', 'factorB').fillna(0).astype(int).stack().reset_index()
In [1863]: res
Out[1863]:
factorA factorB count
0 cat apartment 1
1 cat cabin 0
2 cat house 3
3 cat shack 3
4 cat trailer 0
5 dog apartment 2
6 dog cabin 0
7 dog house 1
8 dog shack 1
9 dog trailer 1
10 gecko apartment 3
11 gecko cabin 0
12 gecko house 0
13 gecko shack 0
14 gecko trailer 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.