简体   繁体   English

Pandas 从另一个 dataframe 填充 dataframe 中的缺失值

[英]Pandas fill missing values in dataframe from another dataframe

I cannot find a pandas function (which I had seen before) to substitute the NaN's in a dataframe with values from another dataframe (assuming a common index which can be specified).我找不到 pandas function(我以前见过)来用另一个 dataframe 中的值替换 dataframe 中的 NaN(假设可以指定一个公共索引)。 Any help?有什么帮助吗?

If you have two DataFrames of the same shape, then:如果您有两个相同形状的 DataFrame,则:

df[df.isnull()] = d2

Will do the trick.会做的伎俩。

视觉表现

Only locations where df.isnull() evaluates to True (highlighted in green) will be eligible for assignment.只有df.isnull()评估为True (以绿色突出显示)的位置才有资格分配。

In practice, the DataFrames aren't always the same size / shape, and transforming methods (especially .shift() ) are useful.实际上,DataFrame 的大小/形状并不总是相同,转换方法(尤其是.shift() )很有用。

Data coming in is invariably dirty, incomplete, or inconsistent.传入的数据总是脏的、不完整的或不一致的。 Par for the course.课程标准。 There's a pretty extensive pandas tutorial and associated cookbook for dealing with these situations.有一个非常广泛的 Pandas教程和相关的食谱来处理这些情况。

正如我刚刚了解到的,有一个DataFrame.combine_first()方法,它正是这样做的,具有附加属性,如果您更新的数据框d2大于原始df ,则还会添加额外的行和列。

df = df.combine_first(d2)

这应该很简单

df.fillna(d2)

A dedicated method for this is DataFrame.update :一个专门的方法是DataFrame.update

Quoted from the documentation:引用自文档:

Modify in place using non-NA values from another DataFrame.使用来自另一个 DataFrame 的非 NA 值就地修改。
Aligns on indices.在索引上对齐。 There is no return value.没有返回值。

Important to note is that this method will modify your data inplace .需要注意的是,此方法将就地修改您的数据。 So it will overwrite your updated dataframe.所以它会覆盖你更新的数据框。

Example :示例

print(df1)
       A    B     C
aaa  NaN  1.0   NaN
bbb  NaN  NaN  10.0
ccc  3.0  NaN   6.0
ddd  NaN  NaN   NaN
eee  NaN  NaN   NaN

print(df2)
         A    B     C
index                
aaa    1.0  1.0   NaN
bbb    NaN  NaN  10.0
eee    NaN  1.0   NaN

# update df1 NaN where there are values in df2
df1.update(df2)
print(df1)
       A    B     C
aaa  1.0  1.0   NaN
bbb  NaN  NaN  10.0
ccc  3.0  NaN   6.0
ddd  NaN  NaN   NaN
eee  NaN  1.0   NaN

Notice the updated NaN values at intersect aaa, A and eee, B注意在aaa, Aeee, B相交处更新的NaN

DataFrame.combine_first() answers this question exactly. DataFrame.combine_first()准确地回答了这个问题。

However, sometimes you want to fill/replace/overwrite some of the non-missing (non-NaN) values of DataFrame A with values from DataFrame B. That question brought me to this page, and the solution is DataFrame.mask()但是,有时您想用 DataFrame B 中的值填充/替换/覆盖 DataFrame A 的一些非缺失(非 NaN)值。这个问题让我来到了这个页面,解决方案是DataFrame.mask()

A = B.mask(condition, A)

When condition is true, the values from A will be used, otherwise B's values will be used.condition为真时,将使用 A 的值,否则将使用 B 的值。

For example, you could solve the OP's original question with mask such that when an element from A is non-NaN, use it, otherwise use the corresponding element from B.例如,您可以使用mask解决 OP 的原始问题,以便当 A 中的元素为非 NaN 时,使用它,否则使用 B 中的相应元素。

But using DataFrame.mask() you could replace the values of A that fail to meet arbitrary criteria (less than zero? more than 100?) with values from B. So mask is more flexible, and overkill for this problem, but I thought it was worthy of mention (I needed it to solve my problem).但是使用DataFrame.mask()你可以用 B 的值替换不符合任意标准(小于零?大于 100?)的 A 值。所以mask更灵活,对于这个问题来说太过分了,但我认为值得一提(我需要它来解决我的问题)。

It's also important to note that B could be a numpy array instead of a DataFrame.同样重要的是要注意 B 可能是一个 numpy 数组而不是 DataFrame。 DataFrame.combine_first() requires that B be a DataFrame, but DataFrame.mask() just requires that B's is an NDFrame and its dimensions match A's dimensions. DataFrame.combine_first()要求 B 是 DataFrame,但DataFrame.mask()只要求 B 是 NDFrame 并且其尺寸与 A 的尺寸匹配。

One important info missing from the other answers is that both combine_first and fillna match on index, so you have to make the indices of match across the DataFrames for these methods to work.其他答案中缺少的一个重要信息是combine_firstfillna都匹配索引,因此您必须使索引在 DataFrame 中匹配才能使这些方法起作用。

Oftentimes, there's a need to match on some other column(s) to fill in missing values.通常,需要匹配其他一些列来填充缺失值。 In that case, you need to use set_index first to make the columns to be matched, the index.那样的话,就需要先用set_index来使要匹配的列,即索引。

df1 = df1.set_index(cols_to_be_matched).fillna(df2.set_index(cols_to_be_matched)).reset_index()

or要么

df1 = df1.set_index(cols_to_be_matched).combine_first(df2.set_index(cols_to_be_matched)).reset_index()

Another option is to use merge :另一种选择是使用merge

df1 = (df1.merge(df2, on=cols_to_be_matched, how='left', suffixes=('','\x00'))
       .sort_index(axis=1).bfill(axis=1)[df.columns])

The idea here is to left-merge and by sorting the columns (we use '\x00' as the suffix for columns from df2 since it's the character with the lowest Unicode value), we make sure the same column values end up next to each other.这里的想法是左合并并通过对列进行排序(我们使用'\x00'作为df2中列的后缀,因为它是具有最低 Unicode 值的字符),我们确保相同的列值在每个列的旁边结束其他。 Then use bfill horizontally to update df1 with values from df2 .然后水平使用bfill以使用df2中的值更新df1


Example:例子:

Suppose you had df1 :假设你有df1

   C1 C2   C3  C4
0   1  a  1.0   0
1   1  b  NaN   1
2   2  b  NaN   2
3   2  b  NaN   3

and df2df2

   C1 C2  C3
0   1  b   2
1   2  b   3

and you want to fill in the missing values in df1 with values in df2 for each pair of C1 - C2 value pair.并且您想用df2中的值为每对C1 - C2值对填充df1中的缺失值。 Then然后

cols_to_be_matched = ['C1', 'C2']

and all of the codes above produce the following output (where the values are indeed filled as required):并且上面的所有代码都会产生以下输出(其中确实根据需要填充了值):

   C1 C2   C3  C4
0   1  a  1.0   0
1   1  b  2.0   1
2   2  b  3.0   2
3   2  b  3.0   3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM