简体   繁体   English

使用pandas.DataFrame设置值

[英]Setting values with pandas.DataFrame

Having this DataFrame: 具有此DataFrame:

import pandas

dates = pandas.date_range('2016-01-01', periods=5, freq='H')
s = pandas.Series([0, 1, 2, 3, 4], index=dates)
df = pandas.DataFrame([(1, 2, s, 8)], columns=['a', 'b', 'foo', 'bar'])
df.set_index(['a', 'b'], inplace=True)

df

在此处输入图片说明

I would like to replace the Series in there with a new one that is simply the old one, but resampled to a day period (ie x.resample('D').sum().dropna() ). 我想用一个简单的旧系列替换那里的系列,但是重新采样到一天的时间(即x.resample('D').sum().dropna() )。

When I try: 当我尝试:

df['foo'][0] = df['foo'][0].resample('D').sum().dropna()

That seems to work well: 这似乎运作良好:

在此处输入图片说明

However, I get a warning: 但是,我得到一个警告:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

The question is, how should I do this instead? 问题是,我应该怎么做呢?

Notes 笔记

Things I have tried but do not work (resampling or not, the assignment raises an exception): 我尝试过但不起作用的事情(是否重新采样,分配引发异常):

df.iloc[0].loc['foo'] = df.iloc[0].loc['foo']
df.loc[(1, 2), 'foo'] = df.loc[(1, 2), 'foo']
df.loc[df.index[0], 'foo'] = df.loc[df.index[0], 'foo']

A bit more information about the data (in case it is relevant): 有关数据的更多信息(如果相关):

  • The real DataFrame has more columns in the multi-index. 实际的DataFrame在多索引中具有更多列。 Not all of them necessarily integers, but more generally numerical and categorical. 它们并非全部都是整数,而是更一般的数字和分类。 The index is unique (ie: there is only one row with a given index value). 索引是唯一的(即:只有一行具有给定的索引值)。
  • The real DataFrame has, of course, many more rows in it (thousands). 当然,实际的DataFrame中有更多的行(数千)。
  • There are not necessarily only two columns in the DataFrame and there may be more than 1 columns containing a Series type. DataFrame中不一定只有两列,并且可能有不止一个包含Series类型的列。 Columns usually contain series, categorical data and numerical data as well. 列通常也包含系列,分类数据和数值数据。 Any single column is always single-typed (either numerical, or categorical, or series). 任何单个列始终为单一类型(数字,类别或系列)。
  • The series contained in each cell usually have a variable length (ie: two series/cells in the DataFrame do not, unless pure coincidence, have the same length, and will probably never have the same index anyway, as dates vary as well between series). 每个单元格中包含的系列通常具有可变的长度(即:DataFrame中的两个系列/单元格除非完全符合,否则不会具有相同的长度,并且可能永远不会具有相同的索引,因为系列之间的日期也会有所不同)。

Using Python 3.5.1 and Pandas 0.18.1. 使用Python 3.5.1和Pandas 0.18.1。

This should work: 这应该工作:

df.iat[0, df.columns.get_loc('foo')] = df['foo'][0].resample('D').sum().dropna()

Pandas is complaining about chained indexing but when you don't do it that way it's facing problems assigning whole series to a cell. 熊猫抱怨链式索引,但是当您不这样做时,它将面临将整个系列分配给一个单元的问题。 With iat you can force something like that. 使用iat您可以强制执行类似操作。 I don't think it would be a preferable thing to do, but seems like a working solution. 我认为这样做不是一件可取的事情,但似乎是一个可行的解决方案。

Hierarchical data in pandas 熊猫中的分层数据

It really seems like you should consider restructure your data to take advantage of pandas features such as MultiIndexing and DateTimeIndex . 看来,您似乎应该考虑重组数据以利用诸如MultiIndexingDateTimeIndex类的熊猫功能。 This will allow you to still operate on a index in the typical way while being able to select on multiple columns across the hierarchical data ( a , b , and bar ). 这将使您仍可以按常规方式对索引进行操作,同时可以在层次结构数据abbar )的多个列上进行选择

Restructured Data 重组数据

import pandas as pd

# Define Index
dates = pd.date_range('2016-01-01', periods=5, freq='H')
# Define Series
s = pd.Series([0, 1, 2, 3, 4], index=dates)

# Place Series in Hierarchical DataFrame
heirIndex = pd.MultiIndex.from_arrays([1,2,8], names=['a','b', 'bar'])
df = pd.DataFrame(s, columns=heirIndex)

print df

a                    1
b                    2
bar                  8
2016-01-01 00:00:00  0
2016-01-01 01:00:00  1
2016-01-01 02:00:00  2
2016-01-01 03:00:00  3
2016-01-01 04:00:00  4

Resampling 重采样

With the data in this format, resampling becomes very simple. 使用这种格式的数据,重新采样变得非常简单。

# Simple Direct Resampling
df_resampled = df.resample('D').sum().dropna()

print df_resampled

a            1
b            2
bar          8
2016-01-01  10

Update (from data description) 更新(根据数据描述)

If the data has variable length Series each with a different index and non-numeric categories that is ok. 如果数据的长度可变,则Series具有不同的index和非数字类别,则可以。 Let's make an example: 让我们举个例子:

# Define Series
dates = pandas.date_range('2016-01-01', periods=5, freq='H')
s = pandas.Series([0, 1, 2, 3, 4], index=dates)

# Define Series
dates2 = pandas.date_range('2016-01-14', periods=6, freq='H')
s2 = pandas.Series([-200, 10, 24, 30, 40,100], index=dates2)
# Define DataFrames
df1 = pd.DataFrame(s, columns=pd.MultiIndex.from_arrays([1,2,8,'cat1'], names=['a','b', 'bar','c']))
df2 = pd.DataFrame(s2, columns=pd.MultiIndex.from_arrays([2,5,5,'cat3'], names=['a','b', 'bar','c']))

df = pd.concat([df1, df2])
print df

a                      1      2
b                      2      5
bar                    8      5
c                   cat1   cat3
2016-01-01 00:00:00  0.0    NaN
2016-01-01 01:00:00  1.0    NaN
2016-01-01 02:00:00  2.0    NaN
2016-01-01 03:00:00  3.0    NaN
2016-01-01 04:00:00  4.0    NaN
2016-01-14 00:00:00  NaN -200.0
2016-01-14 01:00:00  NaN   10.0
2016-01-14 02:00:00  NaN   24.0
2016-01-14 03:00:00  NaN   30.0
2016-01-14 04:00:00  NaN   40.0
2016-01-14 05:00:00  NaN  100.0

The only issues is that after resampling. 唯一的问题是重新采样后。 You will want to use how='all' while dropping na rows like this: 您将要使用how='all'而下降na行是这样的:

# Simple Direct Resampling
df_resampled = df.resample('D').sum().dropna(how='all')

print df_resampled

a              1    2
b              2    5
bar            8    5
c           cat1 cat3
2016-01-01  10.0  NaN
2016-01-14   NaN  4.0

只需在分配新值之前将df.is_copy = False设置df.is_copy = False

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM