简体   繁体   English

使用理解更新数据框列

[英]updating a dataframe column using a comprehension

I have been updating dataframe columns with list comprehensions for a while, without issue. 我已经用列表推导更新数据框列一段时间了,没有问题。 If i have a filter on the dataframe, this raises problems, the column is not updated, even if the comprehension returns the correct values. 如果我在数据帧上有一个筛选器,则会引发问题,即使理解返回正确的值,该列也不会更新。 The below is a contrived example, purely to illustrate the issue. 下面是一个人为的示例,仅用于说明问题。

I first update the Town column to be the same as Region, if region is populated. 如果填充了区域,那么我首先将“城镇”列更新为与“区域”相同。 I then try to find a value for Town in the Address if it has not been populated. 然后,如果尚未填充,则尝试在“地址”中查找Town的值。 Issue is that the second update statement does not work. 问题是第二个更新语句不起作用。

Its clear my understanding of comprehensions is not adequate, so would appreciate advice on what i am doing wrong. 很明显,我对理解的理解不够充分,因此希望能就我做错的事情提出建议。 Thanks! 谢谢!

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import math
import pyodbc

#create dataframe

data = [{'Address': '123 Fake st, someTown, Nebraska', 'Region':'nebraska', 'metric1':50,'Town':''},
    {'Address': '2345 Fake st, someTown, Nebraska', 'Region':'nebraska', 'metric1':50,'Town':''},
    {'Address': '43 Fake st, someTown, Nebraska', 'Region':'nebraska', 'metric1':50,'Town':''},
    {'Address': '1 Fake st, someTown, Nebraska', 'Region':'nebraska', 'metric1':50,'Town':''},
    {'Address': '43 Fake st, someTown, NOBraska', 'Region':'', 'metric1':50,'Town':''},
    {'Address': '6 Fake st, someTown, NOBraska', 'Region':'', 'metric1':50,'Town':''},
    {'Address': '45 Fake st, someTown, NOBraska', 'Region':'', 'metric1':50,'Town':''},]

dataset = pd.DataFrame(data)

#set Town column to the region.
dataset['Town'] = [r for r in dataset['Region']]

#if Town column is still blank, find the region in the Address, correcting for a known bad spelling
dataset[dataset['Town'] =='']['Town']  =  ['Nebraska' if sub.split(",")[2].strip() =='NOBraska' else sub.split(",")[2].strip() for sub in dataset[dataset['Town'] =='']['Address'].astype(str)]  

#RESULT: dataset['Town'] is not updated for the case when it is empty are not updated

The problem here is that by using df[rows][cols] access method, you are not accessing the original DataFrame values, but a copy. 这里的问题是,通过使用df[rows][cols]访问方法,您不是在访问原始DataFrame值,而是在访问一个副本。

You should indeed receive a warning like: 您确实应该收到类似以下的警告:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

This situation is described in details here . 这里将详细描述这种情况。

As a rule, you should always use .iloc or .loc when assigning to a slice of the DataFrame. 通常,在分配给.iloc时,应始终使用.iloc.loc

Here is an example of how you can re-write your assignment to actually modify the DataFrame: 这是一个示例,您可以如何重新编写分配以实际修改DataFrame:

new_values = ['Nebraska' if sub.split(",")[2].strip() =='NOBraska'
              else sub.split(",")[2].strip()
              for sub in dataset[dataset['Town'] =='']['Address'].astype(str)]

# In this way I am getting the labels of the index, so that I can use .loc
empty_town_rows = dataset.index[dataset['Town'] =='']

dataset.loc[empty_town_rows, 'Town']  =  new_values

Personally, I always prefer to use .loc/.iloc when doing modifying values of the DataFrame, so I would also re-write the first assignment. 就个人而言,在修改DataFrame的值时,我总是喜欢使用.loc / .iloc,因此我也将重新编写第一个赋值。 But this is not necessary as there is no issue of view vs copy. 但这不是必需的,因为不存在视图与复制的问题。

dataset.loc[:, 'Town'] = [r for r in dataset['Region']]

I would suggest that you should use loc to update the value in a data frame. 我建议您应该使用loc来更新数据框中的值。

In your case, you should use 在您的情况下,您应该使用

dataset.loc[dataset['Town'] =='', 'Town'] = ['Nebraska' if sub.split(",")[2].strip() =='NOBraska' else sub.split(",")[2].strip() for sub in dataset[dataset['Town'] =='']['Address'].astype(str)]

Personally, I would suggest that you do this way 就个人而言,我建议您这样做

updateTown = lambda row: row["Region"] if row["Region"] else row["Address"].split(",")[2].strip()
dataset['Town'] = dataset.apply(updateTown, axis=1)

@FLab describes well the problem. @FLab很好地描述了问题。

But your code can be improved further to make it more performant / readable: 但是您的代码可以进一步改进以提高性能/可读性:

def replacer(sub):
    x = sub.split(',')[2].strip()
    return 'Nebraska' if x == 'NOBaska' else x

dataset.loc[dataset['Town'] == '', 'Town']  =  \
dataset.loc[dataset['Town'] == '', 'Address'].astype(str).apply(replacer)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM