简体   繁体   English

如何用零替换不平衡数据框中的缺失值?

[英]How can I replace missing values from an unbalanced data frame with zeros?

I have a dataframe with with two dimensions, A and B;我有一个 dataframe 有两个维度,A 和 B; however this data frame is unbalanced in that some of the values are missing because the database does not include values for all possible combinations of A and B. What I want to do is make sure the dataframe is balanced, and those elements that are missing to be filled with zeros.但是,此数据框是不平衡的,因为数据库不包含 A 和 B 的所有可能组合的值,因此缺少某些值。我要做的是确保 dataframe 是平衡的,并且那些缺少的元素用零填充。

I am grabbing the data for the dataframe from an sqlite database that I'm conneting to via SQLAlchemy, using the following code我正在使用以下代码从 sqlite 数据库中获取 dataframe 的数据,我通过 SQLAlchemy 连接到该数据库

connection = sqlite3.connect("my_database.db")
cursor = connection.cursor()
# set up a query
cursor.execute('SELECT factorA, factorB,  COUNT(*) as unique_driver_counts FROM block_optimizer_runs GROUP BY 1, 2')
results = cursor.fetchall()
results_df = pd.DataFrame(results, columns=['factorA', 'factorB', 'count'])

But the database doesn't include data for all possible combinations of factorA, factorB, and factorC.但该数据库不包含所有可能的因素 A、因素 B 和因素 C 组合的数据。 When these values do not exist, the database returns no value;当这些值不存在时,数据库不返回值; but in the dataframe I need these 'missing' values to be filled with a zero.但在 dataframe 中,我需要用零填充这些“缺失”值。

For example例如

import pandas as pd
data = [['dog', 'house', 1], ['dog', 'apartment', 2], ['dog', 'trailer', 1], ['dog', 'cabin', 0], ['dog', 'shack', 1],
['cat', 'house', 3], ['cat', 'apartment', 1], ['cat', 'trailer', 0], ['cat', 'shack', 3],
['gecko', 'apartment', 3], ['gecko', 'trailer', 2], ['gecko', 'shack', 0] ]
df = pd.DataFrame(data, columns = ['factorA', 'factorB', 'count'])
df

but what I want is但我想要的是

import pandas as pd
data = [['dog', 'house', 1], ['dog', 'apartment', 2], ['dog', 'trailer', 1], ['dog', 'cabin', 0], ['dog', 'shack', 1],
['cat', 'house', 3], ['cat', 'apartment', 1], ['cat', 'trailer', 0], ['cat', 'cabin', 0], ['cat', 'shack', 3],
['gecko', 'house', 0], ['gecko', 'apartment', 3], ['gecko', 'trailer', 2], ['gecko', 'cabin', 0], ['gecko', 'shack', 0]]
df = pd.DataFrame(data, columns = ['factorA', 'factorB', 'count'])
df

Can anyone help me figure out how to do this for an arbitrary dataset which may include more than two factors?谁能帮我弄清楚如何对可能包含两个以上因素的任意数据集执行此操作?

Use DataFrame.set_index with DataFrame.reindex and MultiIndex.from_product :DataFrame.set_indexDataFrame.reindexMultiIndex.from_product一起使用:

df = df.set_index(['factorA','factorB'])
df = df.reindex(pd.MultiIndex.from_product(df.index.levels), fill_value=0).reset_index()
print (df)
   factorA    factorB  count
0      cat  apartment      1
1      cat      cabin      0
2      cat      house      3
3      cat      shack      3
4      cat    trailer      0
5      dog  apartment      2
6      dog      cabin      0
7      dog      house      1
8      dog      shack      1
9      dog    trailer      1
10   gecko  apartment      3
11   gecko      cabin      0
12   gecko      house      0
13   gecko      shack      0
14   gecko    trailer      2

Or with Series.unstack for add 0 with DataFrame.stack :或者使用Series.unstack使用DataFrame.stack添加0

df = (df.set_index(['factorA','factorB'])['count']
         .unstack(fill_value=0)
         .stack()
         .reset_index(name='count'))

One way could be to use df.pivot :一种方法是使用df.pivot

In [1862]: res = df.pivot('factorA', 'factorB').fillna(0).astype(int).stack().reset_index()

In [1863]: res
Out[1863]: 
   factorA    factorB  count
0      cat  apartment      1
1      cat      cabin      0
2      cat      house      3
3      cat      shack      3
4      cat    trailer      0
5      dog  apartment      2
6      dog      cabin      0
7      dog      house      1
8      dog      shack      1
9      dog    trailer      1
10   gecko  apartment      3
11   gecko      cabin      0
12   gecko      house      0
13   gecko      shack      0
14   gecko    trailer      2

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 用数据框中的零填充缺失的行 - Fill missing rows with zeros from a data frame 如何用负1替换熊猫数据框中的零 - How to replace zeros in Pandas Data Frame by negative 1 如何按小时组织数据并将缺失值设置为零? - How can I organize data hour-by-hour and set the missing values to zeros? 如何用零替换缺失数据的值? - How can i replace the values in respect with with missing data with Zero? 如何在我的数据框中找到缺失值,处理这些缺失值的最佳方法是什么? - how can i find the missing values in my data frame and what is the best method for handle this missing values? 用同一列中相邻行的平均值替换数据框中的零 - Replace zeros in the data frame with average values of adjacent rows in the same column 使用 Pandas 在数据中添加缺失期间,并用零填充值。 那么我该如何 select 每个段来操作 - Adding missing period in the data using Pandas and fill values with Zeros. How can I then select each segment to manipulation 如何在计算中使用 Pandas 数据框中的值? - How can I use values from a Pandas data frame in a calcul? 如何从数据框和字符串中找到匹配值? - How can I find matching values from a data frame and a string? 给定一个数据框,如何检查列的值按递增顺序排列而没有任何丢失的数字? - How can I check, given a data frame that the values of a column are in increasing order without any missing number?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM