简体   繁体   中英

Merging multiple dataframes together that have different columns except 5

I push 1,424 dataframes into a list as so:

import os
df = []
i = 0
for filename in os.listdir(output_path):
    if filename.endswith(".csv"):
        df.append(pd.read_csv(os.path.join(output_path, filename)))
    else:
        continue

I want to merge all of them together. Here is an example I want to emulate using only 2 dfs:

df1 = pd.read_csv('../output/2009/census_data_2009_p1.csv')
df2 = pd.read_csv('../output/2009/census_data_2009_p2.csv')
df1 = df1.merge(df2, how = 'left', on = ['Location+Type', 'Year', 'state', 'Census_tract', 'County_name'])

How would I do the latter but for all of them in the df list? Specifically I want to do a left join on all the dataframes using the keys 'Location+Type', 'Year', 'state', 'Census_tract', 'County_name'

I am currently getting this error even though I have 64 gb of RAM.

The kernel appears to have died. It will restart automatically.

This occurs when I either run this code:

from functools import reduce
df_merged = reduce(lambda l, r: pd.merge(l, r, 
                                         how='left',
                                         on=['Location+Type', 
                                             'Year',
                                             'state',
                                             'Census_tract',
                                             'County_name']), dfs)

or this code

[
    dfi.set_index(
        ["Location+Type", "Year", "state", "Census_tract", "County_name"], inplace=True
    )
    for dfi in df
]

df[0].join(df[1:], how="left")

Try using set index and join:

[
    dfi.set_index(
        ["Location+Type", "Year", "state", "Census_tract", "County_name"], inplace=True
    )
    for dfi in df
]

df[0].join(df[1:], how="left")

Firstly, a little clean up in the first block of code:

import os
dfs = []
for filename in os.listdir(output_path):
    if filename.endswith(".csv"):
        dfs.append(pd.read_csv(os.path.join(output_path, filename)))

To combine a list of DataFrames into a single DataFrame:

pd.concat(dfs, join='inner')

The join='inner' picks only the common columns among the list of DataFrames.

A short demo:

df1 = pd.DataFrame(data=[[1,2,3], [2,3,1]], columns=['a', 'b', 'c'])

    a   b   c
0   1   2   3
1   2   3   1

df2 = pd.DataFrame(data=[[1,2,3], [2,3,1]], columns=['b', 'c', 'd'])

    b   c   d
0   1   2   3
1   2   3   1

pd.concat([df1, df2], join='inner')

    b   c
0   2   3
1   3   1
0   1   2
1   2   3

Note the resulting index. If required, you can use reset_index() to reset the index.

I belive one of the cleanest options is to map the merge operation using reduce :

from functools import reduce
df_merged = reduce(lambda l, r: pd.merge(l, r, 
                                         how='left',
                                         on=['Location+Type', 
                                             'Year',
                                             'state',
                                             'Census_tract',
                                             'County_name']), df)

This assumes that the dataframes are sorted in the desired way, however.

A more memory efficient way to do this (but arguably less clean) is to simply iterate the dataframes:

df_merged = df[0].copy() # we use the initial dataframe to start
del df[0]
for _ in range(len(df)): 
    df_merged = df_merged.merge(df[0],
                                how='left', 
                                on=['Location+Type', 
                                    'Year',
                                    'state',
                                    'Census_tract',
                                    'County_name'])
    # this will free up the contents of the recently merged dataframe   
    del df[0]
# For the columns in the variable
columns = ['Location+Type', 'Year', 'state', 'Census_tract', 'County_name']

# Set the indexes in order to join by index
for my_df in df:
  my_df.set_index(columns)

# Join the dataframe
res_df = df[0]
for index in range(1, len(df)):
  res_df.join(df[index], how='outer')

# Or if you only want to merge
pd.concat(df, join='outer')

First of all I would use generator to save memory of your machine time of execution could be longer but your machine will process only one file aa time

import os
import pandas as pd 


def census_dataframes():
    for filename in os.listdir(output_path):
        if filename.endswith(".csv"):
            yield pd.read_csv(os.path.join(output_path, filename))
        else:
            continue
            
            
dataframes = census_dataframes()

#get first dataframe from generator
df1 = next(dataframes)

for frame in dataframes:

    df1 = df1.merge(frame, how = 'left', on = ['Location+Type', 'Year', 'state', 'Census_tract', 'County_name'])

If above approach do not bring result then please check the size of your output dataframe. For effective use you need at least 2x more memory than your dataframe requires.

Further memory savings can be done by optimizing datatypes during reading csv like

yield pd.read_csv(os.path.join(output_path, filename), dtype= {‘a’: np.float32, ‘b’: np.int32, ‘c’: ....})

in case you have text entries that frequently repeats in column ( like 'Man', 'Female', 'Not Disclosed', ... ) you can convert them to categories and save significant amount of memory. However to do this on large population of files requires prior preparation, upfront categories definition.
Please refer to pandas documentation on topic 'Categorical Data'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM