简体   繁体   中英

wrong scale in yahoo finance python

I am trying to download historical prices using yahoo finance for a few mutual funds and ETFs. I think there is a bug in Yahoo finance that confuses the scales of the prices (I don't think is a split issue. See pics below).

In any case here is a MWE to reproduce the problem

import yfinance as yf
import pandas as pd 
from pandas_datareader import data as pdr
yf.pdr_override()

tickers = ["0P0001FE43.L", "0P00014IJX.L", "SGLN.L"]

start_date = "2019-01-01" 
today      = "2021-04-27"

def getData(ticker):
        #print (ticker)
        data = pdr.get_data_yahoo(ticker, start=start_date, end=today)
        data["yahoo_ticker"] = ticker
        files.append(data)

files=[]
for tik in tickers:
    getData(tik)

df = pd.concat(files)
df = df[ [ "Adj Close", "yahoo_ticker"]]

A closer look at the adjusted prices would show that:

在此处输入图像描述

在此处输入图像描述

在此处输入图像描述

I couldn't think of any systematic way to correct for this problem, so would appreciate any help.

I checked SGLN.L quickly only yahoo's finance page And the data aligns. I'm not a market expert so I can't say what's going on here but that data seems to match. Also the volume up/down becomes a lot more volatile during that period so that could have something to do with it.

雅虎财经截图

To catch and fix one day events I usually check for very big one day gaps. Something like this, which compares each close with the previous and the next, should fix the 1/100 or 1*100 problems, but nothing prevents you to set another threshold or use multiple thresholds at once. The multiple comparison allows to avoid corrections when the gap in the historical series is due to a badly managed split.

Code :

fixed = [l[0]]

i = 1

while i < len(l) -1:
        
    p = l[i] / l[i-1]
    n = l[i+1] / l[i]

    if p < 0.1 and n >= 100:
        fixed.append(l[i] * 100)      
    elif p >= 100 and n < 0.1:
        fixed.append(round(l[i] / 100, 5))
    else:
        fixed.append(l[i])
    
    i += 1
            
# Last value
        
p =  fixed[-1] / l[-1]

if p >= 100:
    fixed.append(l[i] * 100)      
elif p < 0.1:
    fixed.append(round(l[i] / 100, 5))
else:
    fixed.append(l[i])

Output :

l = [171.062, 1.71945, 172.901, 172.184]
[171.062, 171.945, 172.901, 172.184]

l = [1.71062, 1.71945, 172.901, 1.72184]
[1.71062, 1.71945, 1.72901, 1.72184]

While this works, if you can download all the (adjusted) OHLC fields for a security (I use Bloomberg Professional as data provider and I don't know if Yahoo supports it) you should always compare them against each other instead of comparing only the close against the previous and the next one. This would be a more robust approach and at the same time you could also check if O and C are inside the H - L range and other kind of more common but subtle errors.

I found another solution that I think should be more robust than my other answer. This does not look at the whole history but for single-day jumps of 50x or more which should never occur "naturally".

import yfinance as yf
import pandas as pd
from pandas_datareader import data as pdr
yf.pdr_override()

tickers = ["0P0001FE43.L", "0P00014IJX.L", "SGLN.L"]

start_date = "2019-01-01"
today      = "2021-04-27"

def getData(ticker):
    data = pdr.get_data_yahoo(ticker, start=start_date, end=today)
    data["yahoo_ticker"] = ticker
    # fix faulty yahoo data that jumps 100x
    jumps_up   = data['Adj Close'] / data['Adj Close'].shift() >  50
    jumps_down = data['Adj Close'] / data['Adj Close'].shift() < .02
    correction_factor = 100.**(jumps_down.cumsum() - jumps_up.cumsum())
    data['Adj Close'] *= correction_factor
    print(f"Fixed {sum(correction_factor != 1)}/{len(data)} for ticker {ticker}"
          f" (min: {data['Adj Close'].min()}, max: {data['Adj Close'].max()})")
    return data

df = pd.concat([getData(tik) for tik in tickers])
df = df[["Adj Close", "yahoo_ticker"]]
print(df)
Fixed 5/587 for ticker 0P0001FE43.L (min: 97.52300262451172, max: 174.1580047607422)
Fixed 1/587 for ticker 0P00014IJX.L (min: 169.8939971923828, max: 376.5379943847656)
Fixed 288/586 for ticker SGLN.L (min: 2195.0, max: 3388.9999389648438)

One approach is to simply spot differences of almost 100x, sadly stocks can move a lot so this might not always work:

import yfinance as yf
import pandas as pd
from pandas_datareader import data as pdr
yf.pdr_override()

tickers = ["0P0001FE43.L", "0P00014IJX.L", "SGLN.L"]

start_date = "2019-01-01"
today      = "2021-04-27"

def getData(ticker):
    data = pdr.get_data_yahoo(ticker, start=start_date, end=today)
    data["yahoo_ticker"] = ticker
    # fix data where it's more than 40 times the min
    mask = data['Adj Close'] < 40 * (data['Adj Close'].min())
    data['Adj Close'] = data['Adj Close'] * (1 + 99 * mask)
    print(f'Fixed {sum(mask)}/{len(mask)} datapoints for ticker {ticker}')
    return data

df = pd.concat([getData(tik) for tik in tickers])
df = df[["Adj Close", "yahoo_ticker"]]
print(df)
Fixed 5/587 datapoints for ticker 0P0001FE43.L
Fixed 1/587 datapoints for ticker 0P00014IJX.L
Fixed 288/586 datapoints for ticker SGLN.L

According to Yahoo Finance its actually a mistake in the dataset not a bug in the package. 在此处输入图像描述

For the occurrences where few days are affected I recommend replacing the faulty rows with NaN and then using pandas.DataFrame.interpolate() .

faulty = df['Adj Close'].le(df['Adj Close'].shift()*0.1)
df.at[faulty[faulty].index, :] = np.nan
df['Adj Close'] = df['Adj Close'].interpolate()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM