简体   繁体   中英

Take columns from two dfs in a spesific order and create a new df

I have two large dataframes. One contains a set of information from jan. 2020 (f2020). The other dataframe (f2021) contains the same information, but for jan. 2021. The dataframes are equal, but the values differ. (Same no.of rows/cols, key names etc.)

I have used the fact that they are euqal to loop over each item in f2021 and subtracting the same item from f2020. The result is added to the f2021 as a column with key = 'diff_key'.

I have created an example, this is before any calculations are done:

f2021 = pd.DataFrame({'C3456R_[Ah]': {0: 2.5,
  1: 4.3, 2: 5.9},
 'C8734_[Ah]': {0: 1.9,
  1: 2.3, 2: 3.9},
 'ts': {0: pd.Timestamp('2020-01-01 02:00:00'),
  1: pd.Timestamp('2020-01-01 03:00:00'),
  2: pd.Timestamp('2020-01-01 04:00:00')}})

Then I do the calculations with values from f2020 and get a resulting f2021 that looks like this:

f2021 = pd.DataFrame({'C3456R_[Ah]': {0: 2.5,
  1: 4.3, 2: 5.9},
 'C8734_[Ah]': {0: 1.9,
  1: 2.3, 2: 3.9},
 'ts': {0: pd.Timestamp('2020-01-01 02:00:00'),
  1: pd.Timestamp('2020-01-01 03:00:00'),
  2: pd.Timestamp('2020-01-01 04:00:00')},
  'diff_C3456R_[Ah]': {0: 0.1,
  1: 0.7, 2: 0.2},
 'diff_C8734_[Ah]': {0: 0.1,
  1: 1.2, 2: 2.2}})

Now, I would like to create a new df that should take both original columns for the same key in f2021 and f2020, add a sufix (_2020 and _2021), and then take the 'diff' column for that key, for all keys. The columns must be sorted so the order is like:

'C3456R_[Ah] 2021','C3456R [Ah] 2020', 'diff_C3456R [Ah]', 'C8734_[Ah] 2021, C8734 [Ah] 2020, diff_C8734 [Ah]... etc.

and the order of the keys in the new df should follow the order of the original keys in f2021.

I tried solving this by creating a list that is in the order I want by looping over different if statements, and appending in lists etc. And thought I could then solve by a merge. First give all keys in both frames suffixes. But this seems like a heavy way to solve this, and harder than one should think.

Is it a smooth way to do this?

Based on your comment, here are what I think are realistic test dataframes with shape (744, 361):

import numpy as pd
import pandas as pd

f2020 = pd.DataFrame( { 'ts' : pd.date_range('2020-01-01','2020-02-01', freq='1H', closed='left') })
f2021 = pd.DataFrame( { 'ts' : pd.date_range('2021-01-01','2021-02-01', freq='1H', closed='left') })
for i in range(360):
    f2020[f"Col_{i}"] = np.random.random(len(f2020))
    f2021[f"Col_{i}"] = np.random.random(len(f2021))

I'll break things into distinct steps for clarity, but you can remove some of the intermediate steps if you want.

Because you are guaranteeing that the dataframes are exactly the same shape/columns, you can just directly subtract the dataframes and concatenate them.

First, some manipulation on column names, skipping the ts column for now:

base_cols = [c for c in f2021.columns if c != 'ts']
cols_2020 = [f"{c}_2020" for c in base_cols]
cols_2021 = [f"{c}_2021" for c in base_cols]
cols_diff = [f"{c}_diff" for c in base_cols]

Now make a timestamp-like column to use later. You can handle this however you like, but these would be strings:

ts = f2021['ts'].dt.strftime("%m-%d %H:%M:%S").to_frame('ts')

Do the subtraction, but drop the original timestamps:

tmp2020 = f2020.drop(columns='ts')
tmp2021 = f2021.drop(columns='ts')
diff = tmp2021.sub(tmp2020)

Then worry about the column names:

tmp2020.columns = cols_2020
tmp2021.columns = cols_2021
diff.columns = cols_diff

Use pd.concat to bring them together (with the timestamp-like column from earlier). This is very fast:

result = pd.concat([ts, tmp2021, tmp2020, diff], axis=1)

Finally, reorder your columns:

import itertools
new_cols = list(itertools.chain.from_iterable(zip(cols_2021, cols_2020, cols_diff)))
result = result[['ts'] + new_cols]

print(result.shape)
(744, 1081)

print(result.columns[:6])
Index(['ts', 'Col_0_2021', 'Col_0_2020', 'Col_0_diff', 'Col_1_2021',
       'Col_1_2020'],
      dtype='object')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM