I am trying to download historical prices using yahoo finance for a few mutual funds and ETFs. I think there is a bug in Yahoo finance that confuses the scales of the prices (I don't think is a split issue. See pics below).
In any case here is a MWE to reproduce the problem
import yfinance as yf
import pandas as pd
from pandas_datareader import data as pdr
yf.pdr_override()
tickers = ["0P0001FE43.L", "0P00014IJX.L", "SGLN.L"]
start_date = "2019-01-01"
today = "2021-04-27"
def getData(ticker):
#print (ticker)
data = pdr.get_data_yahoo(ticker, start=start_date, end=today)
data["yahoo_ticker"] = ticker
files.append(data)
files=[]
for tik in tickers:
getData(tik)
df = pd.concat(files)
df = df[ [ "Adj Close", "yahoo_ticker"]]
A closer look at the adjusted prices would show that:
I couldn't think of any systematic way to correct for this problem, so would appreciate any help.
To catch and fix one day events I usually check for very big one day gaps. Something like this, which compares each close with the previous and the next, should fix the 1/100 or 1*100 problems, but nothing prevents you to set another threshold or use multiple thresholds at once. The multiple comparison allows to avoid corrections when the gap in the historical series is due to a badly managed split.
Code :
fixed = [l[0]]
i = 1
while i < len(l) -1:
p = l[i] / l[i-1]
n = l[i+1] / l[i]
if p < 0.1 and n >= 100:
fixed.append(l[i] * 100)
elif p >= 100 and n < 0.1:
fixed.append(round(l[i] / 100, 5))
else:
fixed.append(l[i])
i += 1
# Last value
p = fixed[-1] / l[-1]
if p >= 100:
fixed.append(l[i] * 100)
elif p < 0.1:
fixed.append(round(l[i] / 100, 5))
else:
fixed.append(l[i])
Output :
l = [171.062, 1.71945, 172.901, 172.184]
[171.062, 171.945, 172.901, 172.184]
l = [1.71062, 1.71945, 172.901, 1.72184]
[1.71062, 1.71945, 1.72901, 1.72184]
While this works, if you can download all the (adjusted) OHLC fields for a security (I use Bloomberg Professional as data provider and I don't know if Yahoo supports it) you should always compare them against each other instead of comparing only the close against the previous and the next one. This would be a more robust approach and at the same time you could also check if O and C are inside the H - L range and other kind of more common but subtle errors.
I found another solution that I think should be more robust than my other answer. This does not look at the whole history but for single-day jumps of 50x or more which should never occur "naturally".
import yfinance as yf
import pandas as pd
from pandas_datareader import data as pdr
yf.pdr_override()
tickers = ["0P0001FE43.L", "0P00014IJX.L", "SGLN.L"]
start_date = "2019-01-01"
today = "2021-04-27"
def getData(ticker):
data = pdr.get_data_yahoo(ticker, start=start_date, end=today)
data["yahoo_ticker"] = ticker
# fix faulty yahoo data that jumps 100x
jumps_up = data['Adj Close'] / data['Adj Close'].shift() > 50
jumps_down = data['Adj Close'] / data['Adj Close'].shift() < .02
correction_factor = 100.**(jumps_down.cumsum() - jumps_up.cumsum())
data['Adj Close'] *= correction_factor
print(f"Fixed {sum(correction_factor != 1)}/{len(data)} for ticker {ticker}"
f" (min: {data['Adj Close'].min()}, max: {data['Adj Close'].max()})")
return data
df = pd.concat([getData(tik) for tik in tickers])
df = df[["Adj Close", "yahoo_ticker"]]
print(df)
Fixed 5/587 for ticker 0P0001FE43.L (min: 97.52300262451172, max: 174.1580047607422)
Fixed 1/587 for ticker 0P00014IJX.L (min: 169.8939971923828, max: 376.5379943847656)
Fixed 288/586 for ticker SGLN.L (min: 2195.0, max: 3388.9999389648438)
One approach is to simply spot differences of almost 100x, sadly stocks can move a lot so this might not always work:
import yfinance as yf
import pandas as pd
from pandas_datareader import data as pdr
yf.pdr_override()
tickers = ["0P0001FE43.L", "0P00014IJX.L", "SGLN.L"]
start_date = "2019-01-01"
today = "2021-04-27"
def getData(ticker):
data = pdr.get_data_yahoo(ticker, start=start_date, end=today)
data["yahoo_ticker"] = ticker
# fix data where it's more than 40 times the min
mask = data['Adj Close'] < 40 * (data['Adj Close'].min())
data['Adj Close'] = data['Adj Close'] * (1 + 99 * mask)
print(f'Fixed {sum(mask)}/{len(mask)} datapoints for ticker {ticker}')
return data
df = pd.concat([getData(tik) for tik in tickers])
df = df[["Adj Close", "yahoo_ticker"]]
print(df)
Fixed 5/587 datapoints for ticker 0P0001FE43.L
Fixed 1/587 datapoints for ticker 0P00014IJX.L
Fixed 288/586 datapoints for ticker SGLN.L
According to Yahoo Finance its actually a mistake in the dataset not a bug in the package.
For the occurrences where few days are affected I recommend replacing the faulty rows with NaN and then using pandas.DataFrame.interpolate()
.
faulty = df['Adj Close'].le(df['Adj Close'].shift()*0.1)
df.at[faulty[faulty].index, :] = np.nan
df['Adj Close'] = df['Adj Close'].interpolate()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.