The Dataset I'm using is: https://www.kaggle.com/rohanrao/nifty50-stock-market-data
It contains stock market data from all NIFTY50 Companies since 2000 up to 2020. Each file contains the following columns: ['Date', 'Symbol', 'Series', 'Prev Close', 'Open', 'High', 'Low', 'Last', 'Close', 'VWAP', 'Volume', 'Turnover', 'Trades', 'Deliverable Volume', '%Deliverble']
I need to compile the 'Close'
Columns from all the files into a single dataframe. With the Date as the index and column name as the filename, ie,
Date ADANIPORTS ASIANPAINTS AXISBANK .....
2000-01-01 0 1500 300
2000-02-02 1 1600 400
...
Some of the files have data only from a later date (say 01-01-2007), in case of missing values of 'Close'
it should be listed as 0, ie, 0 until the date when data is available.
Currently I'm using this code.
df=pd.DataFrame()
for filename in filenames:
file=dir+filename+'.csv'
data = pd.read_csv(file,usecols=lambda x: x in ['Date', 'Close'])
data.rename(columns = {'Close':filename}, inplace = True)
data.set_index('Date',inplace=True)
df.join(data, how='outer')
This returns a (0,0) DataFrame->df
else I tried
#Initialising df with GRASIM.csv, and then using join for the other dataframes
file01 = dir + "GRASIM" + '.csv'
df=pd.read_csv(file01,usecols=lambda x: x in ['Date', 'Close'])
df.rename(columns = {'Close':"GRASIM"}, inplace = True)
df.set_index('Date',inplace = True)
for filename in filenames:
file=dir+filename+'.csv'
data = pd.read_csv(file,usecols=lambda x: x in ['Date', 'Close'])
data.rename(columns = {'Close':filename}, inplace = True)
data.set_index('Date',inplace=True)
df.join(data, how='outer')
But this returns the initially initialized dataframe, ie,
GRASIM
Date
2000-01-03 438.30
2000-01-04 437.15
... ...
The other columns are not added.
What seems to be the problem in this?
One way around this is to use the zipfile module in Python :
from zipfile import ZipFile
#initialize an empty dataframe
df = []
with ZipFile('nifty50-stock-market-data.zip') as myzip:
#get the list of files in the zip
for file in myzip.namelist():
#read each file in the list
with myzip.open(file) as myfile:
#read the file with pandas
#append filename to the dataframe
#and add to the empty df dataframe
#all columns r read in, since some files
#do not have date or close columns
df.append(pd.read_csv(myfile)
.assign(filename = myfile.name.split('.')[0])
)
#concatenate everything and filter for the three relevant columns
everything = pd.concat(df).filter(['Date','Close','filename'])
everything.head()
Date Close filename
0 2007-11-27 962.90 ADANIPORTS
1 2007-11-28 893.90 ADANIPORTS
2 2007-11-29 884.20 ADANIPORTS
3 2007-11-30 921.55 ADANIPORTS
4 2007-12-03 969.30 ADANIPORTS
I am not clear on what output you are looking for. Anyway, I'll explain what I did. First, I unzipped the files into a Kaggle
folder on my C-drive
, and then changed that to my current directory with os.chdir()
Then, I created a blank list, where we will later append dataframes for looping and concat the data.
For the loop, I read in the data and required columns and renamed the columns into their filename without the extension with os.path.splitext
. Next, I appended to the list that I created earlier. After, that the data gets all concatted together and I replaced NaNs with zero. I also included a commented out line -- if you change the column name, you can inspect any given column.
import os
import pandas as pd
os.chdir('C:/Kaggle')
data_list=[]
for file in os.listdir():
data=pd.read_csv(file, usecols=lambda x: x in ['Date', 'Close'])
data.rename(columns = {'Close':data.rename(columns = {'Close':os.path.splitext(file)[0]}, inplace = True)}, inplace = True)
data_list.append(data)
data = pd.concat(data_list, sort=False)
data = data.fillna(0)
# data = data.loc[data.ASIANPAINT !=0]
data
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.