I have a folder with 30 csvs. All of them have unique columns from one another with the exception of a single "UNITID" column. I'm looking to do a groupby function on that UNITID column across all the csvs.
Ultimately I want a single dataframe with all the columns next to each other for each UNITID.
Any thoughts on how I can do that?
Thanks in advance.
Perhaps you could merge the dataframes together, one at a time? Something like this:
# get a list of your csv paths somehow
list_of_csvs = get_filenames_of_csvs()
# load the first csv file into a DF to start with
big_df = pd.read_csv(list_of_csvs[0])
# merge to other csvs into the first, one at a time
for csv in list_of_csvs[1:]:
df = pd.read_csv(csv)
big_df = big_df.merge(df, how="outer", on="UNITID")
All the csvs will be merged together based on UNITID, preserving the union of all columns.
An alternative one-liner to dustin's solution would be the combination of the functool's reduce function and DataFrame.merge()
like so,
from functools import reduce # standard library, no need to pip it
from pandas import DataFrame
# make some dfs
df1
id col_one col_two
0 0 a d
1 1 b e
2 2 c f
df2
id col_three col_four
0 0 A D
1 1 B E
2 2 C F
df3
id col_five col_six
0 0 1 4
1 1 2 5
2 2 3 6
The one-liner:
reduce(lambda x,y: x.merge(y, on= "id"), [df1, df2, df3])
id col_one col_two col_three col_four col_five col_six
0 0 a d A D 1 4
1 1 b e B E 2 5
2 2 c f C F 3 6
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.