在Pandas / Python中使用可变大小的行更新数据框

Question

I have imported an excel sheet into a dataframe in Pandas. 我已经将Excel工作表导入到Pandas的数据框中。 The blank values were replaced by 'NA's. 空白值替换为“ NA”。 What I want to do is, for each of the row values, replace them based on indices of a dictionary or dataframe. 我要为每个行值基于字典或数据框的索引替换它们。

df1 = pd.DataFrame(
    {'c1':['a','a','b','b'], 'c2':['1','2','1','3'], 'c3':['2','NA','3','NA']},index=['first','second','third','last'])

>>> df1
       c1 c2  c3
first  a  1    2
second a  2    NA
third  b  1    3
last   b  3    NA

and I want to replace the values in each row according to the indices of another dataframe (or dict). 我想根据另一个数据框（或字典）的索引替换每一行中的值。

df2=pd.DataFrame(
    {'val':['v1','v2','v3']},index=['1','2','3'])

>>> df2
   val
1  v1  
2  v2 
3  v3

Such that the output becomes 这样输出就变成

>>> out
       c1 c2  c3
first  a  v1  v2
second a  v2  NA
third  b  v1  v3
last   b  v3  NA

How would you do this through Pandas and/or Python? 您将如何通过Pandas和/或Python做到这一点？ One way to do it would be to search row by row, but maybe there is an easier way? 一种方法是逐行搜索，但是也许有更简单的方法吗？

Edit: Importantly, performance becomes an issue in my real case since I am dealing with a 'df1' whose size is 4653 rows × 1984 columns . 编辑：重要的是，由于我要处理的大小为4653行×1984列的'df1'，在我的实际情况下，性能成为一个问题 。

Thank you in advance 先感谢您

Answer 1

One way would be stack + replace + unstack combo: 一种方法是stack + replace + unstack组合：

df1.stack().replace(df2.val).unstack()

Answer 2

Original answer 原始答案

s = df1.squeeze()
df2.replace(s)

replace is very, very slow. replace非常非常慢。 For a larger data set like you have check the following example which is done over 30 million values (more than your 10 million values) in about 20 seconds. 对于像这样的较大数据集，请检查以下示例，该示例在大约20秒内完成了超过3000万个值（超过1000万个值）。 The lookup Series contains 900k values from 0 to 1 million. 查找系列包含900k个值，范围从0到1百万。

'map' is much, much faster. “地图”快得多了。 The only issue with map is that it replaces a value not found with missing so you will have to use fillna with the original DataFrame to replace those missing values. map的唯一问题是它将替换找不到的值，因此您必须将fillna与原始DataFrame一起使用以替换那些丢失的值。

n = 10000000
df = pd.DataFrame({'c1':np.random.choice(list('abcdefghijkl'), n),
                 'c2':np.random.randint(0, 1000000, n),
                 'c3':np.random.randint(0, 1000000, n)})

s = pd.Series(index=np.random.choice(np.arange(1000000), 900000, replace=False), 
              data=np.random.choice(list('adsfjhqwoeriouzxvmn'), 900000, replace=True))

df.stack().map(s).unstack().fillna(df)

You can also do this which is running faster on my data but your data is very wide so it might be slower 您也可以执行此操作，这可以在我的数据上运行得更快，但是您的数据非常宽，因此可能会变慢

df.apply(lambda x: x.map(s)).fillna(df)

And on a DataFrame similar to yours, I am getting 6s to complete. 在与您类似的DataFrame上，我得到6s完成。

df = pd.DataFrame(np.random.randint(0, 1000000, (5000, 2000)))
df.stack().map(s).unstack().fillna(df)

在Pandas / Python中使用可变大小的行更新数据框

问题描述

2 个解决方案

解决方案1
4 2017-01-19 13:17:29

解决方案2
1 已采纳 2017-01-19 13:20:51

在Pandas / Python中使用可变大小的行更新数据框

问题描述

2 个解决方案

解决方案1 4 2017-01-19 13:17:29

解决方案2 1 已采纳 2017-01-19 13:20:51

解决方案1
4 2017-01-19 13:17:29

解决方案2
1 已采纳 2017-01-19 13:20:51