I've got a DataFrame
which contains stock values.
It looks like this:
>>>Data Open High Low Close Volume Adj Close Date
2013-07-08 76.91 77.81 76.85 77.04 5106200 77.04
When I try to make a conditional new column with the following if statement:
Data['Test'] =Data['Close'] if Data['Close'] > Data['Open'] else Data['Open']
I get the following error:
Traceback (most recent call last):
File "<pyshell#116>", line 1, in <module>
Data[1]['Test'] =Data[1]['Close'] if Data[1]['Close'] > Data[1]['Open'] else Data[1]['Open']
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I then used a.all()
:
Data[1]['Test'] =Data[1]['Close'] if all(Data[1]['Close'] > Data[1]['Open']) else Data[1]['Open']
The result was that the entire ['Open']
Column was selected. I didn't get the condition that I wanted, which is to select every time the biggest value between the ['Open']
and ['Close']
columns.
Any help is appreciated.
Thanks.
From a DataFrame like:
>>> df
Date Open High Low Close Volume Adj Close
0 2013-07-08 76.91 77.81 76.85 77.04 5106200 77.04
1 2013-07-00 77.04 79.81 71.81 72.87 1920834 77.04
2 2013-07-10 72.87 99.81 64.23 93.23 2934843 77.04
The simplest thing I can think of would be:
>>> df["Test"] = df[["Open", "Close"]].max(axis=1)
>>> df
Date Open High Low Close Volume Adj Close Test
0 2013-07-08 76.91 77.81 76.85 77.04 5106200 77.04 77.04
1 2013-07-00 77.04 79.81 71.81 72.87 1920834 77.04 77.04
2 2013-07-10 72.87 99.81 64.23 93.23 2934843 77.04 93.23
df.ix[:,["Open", "Close"]].max(axis=1)
might be a little faster, but I don't think it's as nice to look at.
Alternatively, you could use .apply
on the rows:
>>> df["Test"] = df.apply(lambda row: max(row["Open"], row["Close"]), axis=1)
>>> df
Date Open High Low Close Volume Adj Close Test
0 2013-07-08 76.91 77.81 76.85 77.04 5106200 77.04 77.04
1 2013-07-00 77.04 79.81 71.81 72.87 1920834 77.04 77.04
2 2013-07-10 72.87 99.81 64.23 93.23 2934843 77.04 93.23
Or fall back to numpy:
>>> df["Test"] = np.maximum(df["Open"], df["Close"])
>>> df
Date Open High Low Close Volume Adj Close Test
0 2013-07-08 76.91 77.81 76.85 77.04 5106200 77.04 77.04
1 2013-07-00 77.04 79.81 71.81 72.87 1920834 77.04 77.04
2 2013-07-10 72.87 99.81 64.23 93.23 2934843 77.04 93.23
The basic problem is that if/else
doesn't play nicely with arrays, because if (something)
always coerces the something
into a single bool
. It's not equivalent to "for every element in the array something, if the condition holds" or anything like that.
In [7]: df = DataFrame(randn(10,2),columns=list('AB'))
In [8]: df
Out[8]:
A B
0 -0.954317 -0.485977
1 0.364845 -0.193453
2 0.020029 -1.839100
3 0.778569 0.706864
4 0.033878 0.437513
5 0.362016 0.171303
6 2.880953 0.856434
7 -0.109541 0.624493
8 1.015952 0.395829
9 -0.337494 1.843267
This is a where conditional, saying give me the value for A if A > B, else give me B
# this syntax is EQUIVALENT to
# df.loc[df['A']>df['B'],'A'] = df['B']
In [9]: df['A'].where(df['A']>df['B'],df['B'])
Out[9]:
0 -0.485977
1 0.364845
2 0.020029
3 0.778569
4 0.437513
5 0.362016
6 2.880953
7 0.624493
8 1.015952
9 1.843267
dtype: float64
In this case max
is equivalent
In [10]: df.max(1)
Out[10]:
0 -0.485977
1 0.364845
2 0.020029
3 0.778569
4 0.437513
5 0.362016
6 2.880953
7 0.624493
8 1.015952
9 1.843267
dtype: float64
The issue is that you're asking python to evaluate a condition ( Data['Close'] > Data['Open']
) which contains more than one boolean value. You do not want to use any
or all
since either, since that will set Data['Test']
to either Data['Open']
or Data['Close']
.
There might be a cleaner method, but one approach is to use a mask (boolean array):
mask = Data['Close'] > Data['Open']
Data['Test'] = pandas.concat([Data['Close'][mask].dropna(), Data['Open'][~mask].dropna()]).reindex_like(Data)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.