简体   繁体   中英

How to merge many CSV files?

I have about 7500 csv that needs to be merged into a single one in order to create an easy-readable table. The files format are as follow:

  1. each file is denominated as a ticker of a stock (example: AA.csv, AAL.csv, AAPL.csv,etc...)

  2. every file itself contains a date and a number in this format

    2018-10-11,1 2018-10-12,3 2018-10-15,2...

Now I want to merge them into a single csv file where the resulting table has in the header the name of the tickers, in the first column the dates and following the numbers (obviously keeping the csv format).

Example:

示例截图

Note that some csv files are empty and some have differents starting dates or dates are just discontinued

You could do something like this:

import pandas as pd
import numpy as np
from glob import glob

dfs_list = []
for csv_file in glob('Tickers List/*.csv'):
    stock_ticker = csv_file.split(".")[0]
    df = pd.read_csv(csv_file,header=None, names=["date","num"])
    if df.shape[0] >0:
        df["date"] = pd.to_datetime(df["date"],format="%Y-%m-%d")
        df["stock_ticker"] = stock_ticker
        dfs_list.append(df)

final_df = pd.concat(dfs_list)

With glob('dir/*.csv') we get all the csv files on a folder.

After this, you will have a DataFrame that looks like this: 在此处输入图像描述

If you want to change to your format, you can do this: (note that the dates are automatically sorted because is used as index).

final_df = pd.pivot_table(final_df, values='num', index=['date'],
               columns=['stock_ticker'], fill_value=np.nan)

And you will have a DataFrame that looks like this:

在此处输入图像描述

Now you can write this DataFrame to a new csv with:

final_df.to_csv("merged.csv")

FULL CODE

import pandas as pd
import numpy as np
from glob import glob

dfs_list = []
for csv_file in glob('Tickers List/*.csv'):
    stock_ticker = csv_file.split(".")[0]
    df = pd.read_csv(csv_file,header=None, names=["date","num"])
    if df.shape[0] >0:
        df["date"] = pd.to_datetime(df["date"],format="%Y-%m-%d")
        df["stock_ticker"] = stock_ticker
        dfs_list.append(df)

final_df = pd.concat(dfs_list)

final_df = pd.pivot_table(final_df, values='num', index=['date'],
               columns=['stock_ticker'], fill_value=np.nan)

final_df.to_csv("merged.csv")

Since you have non-uniform csvs, and some specific requirements:

  • certain format to output the csv
  • some csv files are empty
  • some csv files have different start dates
  • some csv files have discontinues dates

I'd do something like this with the csv module:

import os,csv
csv_dir = "C:\\Users\Diego\\PycharmProjects\\thesis\\Tickers List\\"
csv_out = 'csvout.csv'

all_dates = []
fin_table = dict()

# iterating over all csv files
for csv_file in os.listdir(csv_dir):
    if True:

        csv_name = csv_file #[:-4] # excluding the ".csv" extension 
        table = dict()
        
        print(csv_dir+csv_file)
        with open(csv_dir+csv_file, 'r') as r:
            csvReader = csv.reader(r)

            for i, row in enumerate(csvReader):
                all_dates.append(row[0])
                table[row[0]] = row[1]
        fin_table[csv_name] = table
    

# populating the final csv 
with open(csv_out, 'w+') as w:
    csvWriter = csv.writer(w)

    # adding the header
    input_row = ['Dates']
    for stock, table in iter(sorted(fin_table.items())):
        input_row.append(stock)
    csvWriter.writerow(input_row)

    # Adding the rest of the table by available date 
    # since, some dates may not have a value, as you said   
    for date in sorted(all_dates): 
        
        # adding the date rows and stock values 1 by 1
        input_row = [date]
        for stock, table in iter(sorted(fin_table.items())):
            # NOVALUE will be put in cells  
            # where a value doesn't exist for a date
            input_row.append(table.get(date, 'NOVALUE')) 
        csvWriter.writerow(input_row)

# ************************** end of program code

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM