How can I groupby over multiple files in a folder in Python?

Question

I have a folder with 30 csvs. All of them have unique columns from one another with the exception of a single "UNITID" column. I'm looking to do a groupby function on that UNITID column across all the csvs.

Ultimately I want a single dataframe with all the columns next to each other for each UNITID.

Any thoughts on how I can do that?

Thanks in advance.

Answer 1

Perhaps you could merge the dataframes together, one at a time? Something like this:

# get a list of your csv paths somehow
list_of_csvs = get_filenames_of_csvs()

# load the first csv file into a DF to start with
big_df = pd.read_csv(list_of_csvs[0])

# merge to other csvs into the first, one at a time
for csv in list_of_csvs[1:]:
    df = pd.read_csv(csv)
    big_df = big_df.merge(df, how="outer", on="UNITID")

All the csvs will be merged together based on UNITID, preserving the union of all columns.

Answer 2

An alternative one-liner to dustin's solution would be the combination of the functool's reduce function and DataFrame.merge()

like so,

from functools import reduce # standard library, no need to pip it
from pandas import DataFrame
# make some dfs

df1
   id col_one col_two
0   0       a       d
1   1       b       e
2   2       c       f
df2
   id col_three col_four
0   0         A        D
1   1         B        E
2   2         C        F
df3
   id  col_five  col_six
0   0         1        4
1   1         2        5
2   2         3        6

The one-liner:

reduce(lambda x,y: x.merge(y, on= "id"), [df1, df2, df3])

   id col_one col_two col_three col_four  col_five  col_six
0   0       a       d         A        D         1        4
1   1       b       e         B        E         2        5
2   2       c       f         C        F         3        6

functools.reduce docs

pandas.DataFrame.merge docs

How can I groupby over multiple files in a folder in Python?

Question

2 answers

solution1
2 2021-03-30 14:27:33

solution2
1 2021-03-30 14:36:44

How can I groupby over multiple files in a folder in Python?

Question

2 answers

solution1 2 2021-03-30 14:27:33

solution2 1 2021-03-30 14:36:44

solution1
2 2021-03-30 14:27:33

solution2
1 2021-03-30 14:36:44