简体   繁体   中英

How to concat/join columns from multiple csv files into 1 DataFrame()?

The Dataset I'm using is: https://www.kaggle.com/rohanrao/nifty50-stock-market-data

It contains stock market data from all NIFTY50 Companies since 2000 up to 2020. Each file contains the following columns: ['Date', 'Symbol', 'Series', 'Prev Close', 'Open', 'High', 'Low', 'Last', 'Close', 'VWAP', 'Volume', 'Turnover', 'Trades', 'Deliverable Volume', '%Deliverble']

I need to compile the 'Close' Columns from all the files into a single dataframe. With the Date as the index and column name as the filename, ie,

Date                       ADANIPORTS          ASIANPAINTS       AXISBANK .....
2000-01-01                     0               1500               300
2000-02-02                     1               1600               400
...     

Some of the files have data only from a later date (say 01-01-2007), in case of missing values of 'Close' it should be listed as 0, ie, 0 until the date when data is available.

Currently I'm using this code.

df=pd.DataFrame()
for filename in filenames:
    file=dir+filename+'.csv'
    data = pd.read_csv(file,usecols=lambda x: x in ['Date', 'Close'])
    data.rename(columns = {'Close':filename}, inplace = True)
    data.set_index('Date',inplace=True)
    df.join(data, how='outer')

This returns a (0,0) DataFrame->df

else I tried

#Initialising df with GRASIM.csv, and then using join for the other dataframes
file01 = dir + "GRASIM" + '.csv'
df=pd.read_csv(file01,usecols=lambda x: x in ['Date', 'Close'])
df.rename(columns = {'Close':"GRASIM"}, inplace = True)
df.set_index('Date',inplace = True)

for filename in filenames:
    file=dir+filename+'.csv'
    data = pd.read_csv(file,usecols=lambda x: x in ['Date', 'Close'])
    data.rename(columns = {'Close':filename}, inplace = True)
    data.set_index('Date',inplace=True)
    df.join(data, how='outer')

But this returns the initially initialized dataframe, ie,

          GRASIM
Date              
2000-01-03  438.30
2000-01-04  437.15
...            ...

The other columns are not added.

What seems to be the problem in this?

One way around this is to use the zipfile module in Python :

from zipfile import ZipFile

#initialize an empty dataframe
df = []
with ZipFile('nifty50-stock-market-data.zip') as myzip:
    #get the list of files in the zip
    for file in myzip.namelist():
        #read each file in the list
        with myzip.open(file) as myfile:
            #read the file with pandas
            #append filename to the dataframe
            #and add to the empty df dataframe
            #all columns r read in, since some files
            #do not have date or close columns
            df.append(pd.read_csv(myfile)
                      .assign(filename = myfile.name.split('.')[0])
                      )
         #concatenate everything and filter for the three relevant columns
         everything = pd.concat(df).filter(['Date','Close','filename'])

 everything.head()  

        Date     Close  filename
0   2007-11-27  962.90  ADANIPORTS
1   2007-11-28  893.90  ADANIPORTS
2   2007-11-29  884.20  ADANIPORTS
3   2007-11-30  921.55  ADANIPORTS
4   2007-12-03  969.30  ADANIPORTS 

I am not clear on what output you are looking for. Anyway, I'll explain what I did. First, I unzipped the files into a Kaggle folder on my C-drive , and then changed that to my current directory with os.chdir() Then, I created a blank list, where we will later append dataframes for looping and concat the data.

For the loop, I read in the data and required columns and renamed the columns into their filename without the extension with os.path.splitext . Next, I appended to the list that I created earlier. After, that the data gets all concatted together and I replaced NaNs with zero. I also included a commented out line -- if you change the column name, you can inspect any given column.

import os
import pandas as pd
os.chdir('C:/Kaggle')
data_list=[]
for file in os.listdir():
    data=pd.read_csv(file, usecols=lambda x: x in ['Date', 'Close'])
    data.rename(columns = {'Close':data.rename(columns = {'Close':os.path.splitext(file)[0]}, inplace = True)}, inplace = True)
    data_list.append(data)
data = pd.concat(data_list, sort=False)
data = data.fillna(0)
# data = data.loc[data.ASIANPAINT !=0]
data

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM