简体   繁体   English

使用类似的行填充 python pandas dataframe 中的缺失行

[英]Fill missing rows in a python pandas dataframe using similar rows

Suppose I have this kind of Dataframe:假设我有这种 Dataframe:

Data:   Lat    Long   Postal Code
    0   41     32     01556
    1   32     31     01023
    2   31     33     01023
    3   NaN    NaN    01023
    4   33     42     01775
    5   40     44     01999

As you can see, rows 1,2,3 have the same postal code.如您所见,第 1、2、3 行具有相同的邮政编码。 So, in order to fill the NaNs, it would be nice to just use the average of those 2 rows (1,2).因此,为了填充 NaN,最好只使用这 2 行 (1,2) 的平均值。 How can I generalize this for a large dataset?我如何将其推广到大型数据集?

  • For each row with NaN data in Lat/Long,对于在 Lat/Long 中具有 NaN 数据的每一行,
    • Find other rows with the same postal code查找具有相同邮政编码的其他行
    • then compute the mean然后计算平均值
    • and use it to replace the NaNs并用它来替换 NaN

IIUC,国际大学联合会,

groupby , transform , fillna() groupbytransformfillna()

We first select a slice of our dataframe and use fillna to only fill missing values, we don't want to overwrite any of the existing data.我们首先 select 切片 dataframe 并使用fillna仅填充缺失值,我们不想覆盖任何现有数据。

we then leverage the groupby function to group by postal codes as you requested.然后我们利用groupby function 按照您的要求按邮政编码分组。

we use the transform method which returns your data with its original index and length.我们使用transform方法返回您的数据及其原始索引和长度。

we assign this to your columns and have the result as below.我们将其分配给您的列,结果如下。

   df[["Lat", "Long"]] = df[["Lat", "Long"]].fillna(
    df.groupby("Postal Code")["Lat", "Long"].transform("mean"))
    print(df)
          Data   Lat  Long  Postal Code
    0     0  41.0  32.0         1556
    1     1  32.0  31.0         1023
    2     2  31.0  33.0         1023
    3     3  31.5  32.0         1023
    4     4  33.0  42.0         1775
    5     5  40.0  44.0         1999

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM