简体   繁体   中英

For loop to merge pandas dataframe with common columns

I have 25 data frames, each of them have 7 ascending dates(as rows) and between 570-600 airport names as columns. The big problem is that, since the dataframes store the number of ascensions each airport has each day, the weeks that certain airports are inactive results in the dataframes having different orders and quantities of like and unlike airport names. All of the column names will appear in alphabetical order in each dataframe, but the absence of just one airport column from the dataframe messes up the entire alignment of the master dataframe.

I have tried merge, concat, join, update...this problem is really complicated and my end goal is to have a master dataframe with all exisiting alphabetically ordered airports as columns, and ongoing rows as the dates ascend and time passes.

I think I have to make a for loop to do this: 1. No data can be lost 2. It needs to merge dataframes by column so that if the column name of the second data frame is the same as the column of the first, the new data will be added below that column without repeating the column name again. 3.If the column name of the second is different than the column name of the first, I want the column to be added as a new column (hopefully in alphabetical order). 4.If the second dataframe does not have a column that the first one does have, I want it to say NAN for that airport.

In sum, the major parts I want the for loop to do is add data under identical columns(even though the dataframes have the columns out of order), add columns that weren't previously there, fill in NANs where airports are missing, and make sure that the column names only appear as the 0 row. Sorry, it is so hard to explain.

Here are two simple dataframe examples that I want the for loop to be able to merge

df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                    columns=['Airport1', 'Airport3', 'Airport4'])
df1.index.name='Dates'
df1.index=['11/01','11/02','11/03']
df2 = pd.DataFrame(np.array([[2, 4, 6], [8, 10, 12], [14, 16, 18]]),
                    columns=['Airport1', 'Airport2', 'Airport3'])
df2.index.name='Dates'
df2.index=['11/04','11/05','11/06']
display(df1,df2)

Dates **Airport1** **Airport3** **Airport4** 
11/01   1.            2.          3.  
11/02   4.            5.          6.   
11/03   7.            8.          9.

Dates **Airport1** **Airport2** **Airport3**
11/04   2.           4.           6
11/05   8.           10.          12
11/06   14.          16.          18

The result I would want the for loop to have is:

Dates **Airport1** **Airport2** **Airport3** **Airport4**
11/01   1.              NAN.         2.           3
11/02.  4.              NAN.         5.           6
11/03.  7.              NAN.         8.           9
11/04.  2.               4.          6.          NAN      
11/05.  8.              10.          12.         NAN
11/06.  14.             16.          18.         NAN

Another note is that I have 25 data frames to merge and counting, so I would like the for loop to be able to take in infinite dataframes. Thanks so much in advance!!!

IIUC, You can try pd.concat along with df.sort_index :

df = pd.concat([df1, df2]).sort_index(axis=1)

In case of more than two dataframes, Use:

from functools import reduce

dfs = [df1, df2] # list of all dataframes that need's to be combined
df = reduce(lambda d1, d2: pd.concat([d1, d2]), dfs).sort_index(axis=1)

Result:

# print(df)

       Airport1  Airport2  Airport3  Airport4
11/01         1       NaN         2       3.0
11/02         4       NaN         5       6.0
11/03         7       NaN         8       9.0
11/04         2       4.0         6       NaN
11/05         8      10.0        12       NaN
11/06        14      16.0        18       NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM