简体   繁体   English

在pandas DataFrame中为列接收NaN

[英]Receiving NaN for a column in pandas DataFrame

This is a piece of code (exercise) from the O'Reilly book Python for Data Analysis . 这是O'Reilly的《 Python for Data Analysis》一书的代码(练习)。

from pandas import Series, DataFrame
import pandas.io.data as web

all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = web.get_data_yahoo(ticker)

price = DataFrame({k: v['Adj Close'] for k,v in all_data.items()})

The strange this is that when I look at the resulting DataFrame, the contents for Google are always NaN : 奇怪的是,当我查看生成的DataFrame时,Google的内容始终是NaN

数据帧

I know that the code is not what you'd call optimal, but these are book exercises and I'm trying to learn from them by experimenting. 我知道代码不是您所谓的最佳代码,但是这些只是书本练习,我正在尝试通过实验向他们学习。

If I take only the data relating to Google and make a DataFrame out of that, the actual figures appear: 如果我仅获取与Google相关的数据并从中获取一个DataFrame,则会显示实际数字:

DataFrame(all_data['GOOG']['Adj Close']).head()

Google DataFrame

But when I try to do the same thing for all ticker symbols, it goes wrong again: 但是,当我尝试对所有股票代号进行相同操作时,它又出错了:

DataFrame([all_data['GOOG']['Adj Close'],
         all_data['AAPL']['Adj Close'],
         all_data['IBM']['Adj Close'],
         all_data['MSFT']['Adj Close']],
         index=['GOOG', 'AAPL', 'IBM', 'MSFT']).T.head()

DataFrame所有代码

Any insight as to what might be causing this would be greatly appreciated! 任何有关可能导致这种情况的见解将不胜感激!

Version info: 版本信息:

  • Python 3.4.2 Python 3.4.2
  • pandas (0.16.2) 熊猫(0.16.2)
  • numpy (1.9.2) numpy的(1.9.2)

Google now has two classes of publicly traded stock, the class C ("GOOG") was issued in 2014, the original A shares trade under "GOOGL". Google现在拥有两类公开交易的股票,C类(“ GOOG”)于2014年发行,原始的A股交易名为“ GOOGL”。 Article here with some more info. 在此处提供更多信息。

So to have the complete history for all 4, just change the ticker. 因此,要拥有所有4的完整历史记录,只需更改股票代码即可。 This also is a pretty good example of what it means for data to "missing". 这也很好地说明了数据“丢失”的含义。 If you wanted to filter to common dates for those original 4 tickers you could do price = price.dropna() 如果您想为那些原始的4个行情过滤器过滤到公共日期,可以执行price = price.dropna()

you are not looking at the full data. 您没有查看完整的数据。 Look at the dates in your two rearrangements. 查看两个重排中的日期。

>>> price.GOOG.isnull().sum()
1064

try tail() 尝试tail()

>>> price.GOOG.head()
Date
2010-01-04   NaN
2010-01-05   NaN
2010-01-06   NaN
2010-01-07   NaN
2010-01-08   NaN

>>> price.GOOG.tail()
Date
2015-08-24    589.609985
2015-08-25    582.059998
2015-08-26    628.619995
2015-08-27    637.609985
2015-08-28    630.380005

I suspect the underlying reason is a RIC change on the part of google. 我怀疑根本原因是Google进行了RIC更改。 They have changed their share structure several times to keep control of voting rights etc. So the stock price is not defined for that stock identifier before a certain date. 他们多次更改了股份结构以控制投票权等。因此,在特定日期之前未为该股票标识符定义股价。

It might help to use an IDE like Spyder - you can view the full data frame in a matlab like way, which stops this kind of thing happening. 使用像Spyder这样的IDE可能会有所帮助-您可以通过类似Matlab的方式查看完整的数据帧,从而阻止了此类事情的发生。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM