简体   繁体   中英

How can I merge these many csv files (around 130,000) using PySpark into one large dataset efficiently?

I posted this question earlier and got some advice to use PySpark instead.

How can I merge this large dataset into one large dataframe efficiently?

The following zip file ( https://fred.stlouisfed.org/categories/32263/downloaddata/INTRNTL_csv_2.zip ) contains a folder called data with around 130,000 of csv files. I want to merge all of them into one single dataframe. I have 16gb of RAM and I keep running out of RAM when I hit the first few hundred files. The files' total size is only about 300-400 mb of data.

If you open up any of the csv files, you can see that they all have the same format, the first column is for dates, and the second column is for the data series.

So now instead I am using PySpark, however I have no idea what is the most efficient way to connect all the files, with pandas dataframes I would just concat the list of individual frames like this because I want them to merge on the dates:

bigframe = pd.concat(listofframes,join='outer', axis=0)

But like I mentioned, this method doesn't work as I run out of RAM really fast.

What would be the best way to do something similar using PySpark?

So far I have this, (by the way the filelist below is just a list of the files which I want to pull out, you can ignore that)


import os

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('spark-dataframe-demo').getOrCreate()
from pyspark.sql import *
from pyspark.sql.functions import col

from functools import reduce
from pyspark.sql import DataFrame

listdf = []

for subdir, dirs, files in os.walk("/kaggle/input/filelist/"):
    for file in files:
        path = os.path.join(subdir,file)
        print(file)
        filelist = pd.read_excel("/kaggle/input/filelist/" + file)

        for row in filelist.File.items():
            df = spark.read.csv(f"/kaggle/input/master/{file[:-5]}/{file[:-5]}/data/" + row[1], inferSchema = True, header = True)
            df = df.select(col("DATE").alias("DATE"),col("VALUE").alias(row[1][:-4]))
            df.show(3)
            listdf.append(df)

I stop the code after it appends like 10 frames, But when I try the code below, it just has one column of data, it doesn't merge properly.

bigframe = reduce(DataFrame.join(listdf, ['DATE'], how='full'))

But I am only left with 2 columns of data, the date and the first item in the list of spark frames.

How do I merge everything into one frame properly? I want the Dates to be the thing index that the other columns merge on. Meaning if one frame has:

Date        TimeSeries1
1 Jan 2012  12345
2 Jan 2012  23456

and the other has

Date        TimeSeries2
1 Jan 2012  5678
3 Jan 2012  8910

I want the final product to be

Date        TimeSeries1 TimeSeries2
1 Jan 2012  12345       5678
2 Jan 2012  23456
3 Jan 2012              8910

Also, to identify the columns, the names have to be changed to the name of the file.

spark can read data from multiple files by default if they contain the same schema.

To process each timeseries separately, you can group by the dataframe by filename and use a pandas udf to process each group.

import glob as g
import pyspark.sql.functions as F

@F.pandas_udf("date Date, value DECIMAL(38,4)", F.PandasUDFType.GROUPED_MAP)
def transform(pdf):
  # pdf will be a pandas datafrmme for each timeseries
  # apply timeseries computations here and return a new dataframe
  # with aggregated values
  return pdf

paths = g.glob("./INTRNTL_csv_2/data/**/*.csv", recursive=True)

df = spark.read.csv(paths, header=False, schema="date DATE, value DECIMAL(38,4)")


res = df.withColumn('name', F.input_file_name())
res = res.groupBy('name').apply(transform)
res.show()

There is a lot of stuff happening here, but if I can distill this to the need to merge data from 130k CSV files into one single DF, and capture the name for each file, you can do it like this.

from  pyspark.sql.functions import input_file_name
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)

customSchema = StructType([ \
StructField("asset_id", StringType(), True), \
StructField("price_date", StringType(), True), \
etc., 
StructField("close_price", StringType(), True), \
StructField("filename", StringType(), True)])

fullpath = 'mnt/INTRNTL_csv_2/data/??/*.csv'

df = spark.read.format("csv") \
   .option("header", "false") \
   .option("sep","|") \
   .schema(customSchema) \
   .load(fullPath) \
   .withColumn("filename", input_file_name())

Notice: The very first line of code and the very last line of code are used to get the file names. Also, pay attention to the wildcards; the '?' is for one single character (either a letter or a number) and the '*' is for any number of characters (any combination of letters and numbers).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM