简体   繁体   English

垂直和水平堆叠 Pandas 中的数据帧

[英]Stack dataframes in Pandas vertically and horizontally

I have a dataframe that looks like this:我有一个看起来像这样的 dataframe:

  country region  region_id  year  doy  variable_a  num_pixels
0     USA   Iowa      12345  2022    1        32.2         100
1     USA   Iowa      12345  2022    2        12.2         100
2     USA   Iowa      12345  2022    3        22.2         100
3     USA   Iowa      12345  2022    4       112.2         100
4     USA   Iowa      12345  2022    5        52.2         100

The year in the dataframe above is 2022. I have more dataframes for other years starting from 2010 onwards.上面的 dataframe 中的年份是 2022 年。从 2010 年开始,我有更多其他年份的数据框。 I have also dataframes for other variables: variable_b , variable_c .我还有其他变量的数据框: variable_bvariable_c

I want to combine all these dataframes into a single dataframe such that我想将所有这些数据帧组合成一个 dataframe 这样

  1. The years are listed vertically, one below the other年份垂直排列,一个在另一个之下
  2. the data for the different variables is listed horizontally.不同变量的数据水平列出。 The output should look like this: output 应如下所示:
  country region  region_id  year  doy  variable_a  variable_b  variable_c
0     USA   Iowa      12345  2010    1        32.2          44         101
1     USA   Iowa      12345  2010    2        12.2          76        2332
..........................................................................
n-1   USA   Iowa      12345  2022    1       321.2         444         501
n     USA   Iowa      12345  2022    2       122.2         756          32

What is the most efficient way to achieve this?实现这一目标的最有效方法是什么? Please note that there will be overlap in years in the other dataframes so the solution needs to take that into account and not leave NaN values.请注意,其他数据帧中的年份会有重叠,因此解决方案需要考虑到这一点,而不是留下 NaN 值。

IIUC, this should work for you: IIUC,这应该适合你:

data1 = {
    'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
    'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowa'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2022, 1: 2022, 2: 2022, 3: 2022, 4: 2022},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_a': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

data2 = {
    'country': {0: 'USB', 1: 'USB', 2: 'USB', 3: 'USB', 4: 'USB'},
    'region': {0: ' Iowb', 1: ' Iowb', 2: ' Iowb', 3: ' Iowb', 4: ' Iowb'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2021, 1: 2021, 2: 2021, 3: 2021, 4: 2021},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_b': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

data3 = {
    'country': {0: 'USC', 1: 'USC', 2: 'USC', 3: 'USC', 4: 'USC'},
    'region': {0: ' Iowc', 1: ' Iowc', 2: ' Iowc', 3: ' Iowc', 4: ' Iowc'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_c1': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'variable_c2': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)

dfn = [df1, df2, df3]

pd.concat(dfn, axis=0).sort_values(['year', 'country', 'region']).reset_index(drop=True)

Output: Output:

在此处输入图像描述

I'm not sure people are hearing the second parts of your question:我不确定人们是否听到了您问题的第二部分:

the data for the different variables is listed horizontally.不同变量的数据水平列出。

and

there will be overlap in years in the other dataframes so the solution needs to take that into account and not leave NaN values.其他数据帧中的年份将重叠,因此解决方案需要考虑到这一点,而不是留下 NaN 值。

I think I understand, and this is my solution.我想我明白了,这就是我的解决方案。

We start by creating a baby dataset of two years, five days each, with two variables.我们首先创建一个两年的婴儿数据集,每五天,有两个变量。

import pandas as pd

# Baseline dummy data
data = {
    'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
    'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowa'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2022, 1: 2022, 2: 2022, 3: 2022, 4: 2022},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

# 2022 data with "a" data
df_2022_a = pd.DataFrame(data)
df_2022_a["variable_a"] = range(5)

# 2022 data with "b" data
df_2022_b = pd.DataFrame(data)
df_2022_b["variable_b"] = range(5, 10)

# 2021 data with "a" data
df_2021_a = pd.DataFrame(data)
df_2021_a["variable_a"] = range(10, 15)
df_2021_a["year"] = 2021

# 2021 data with "b" data
df_2021_b = pd.DataFrame(data)
df_2021_b["variable_b"] = range(15, 20)
df_2021_b["year"] = 2021

frames = [df_2022_a, df_2022_b, df_2021_a, df_2021_b]

# Get the columns that they all share. This is what we'll group by.
# You can hard-code this if you want
common_cols = list(set.intersection(*(set(df.columns) for df in frames)))

# Yes, go ahead and concatenate them together... but there's one more step!
df = pd.concat(frames)
df

在此处输入图像描述

Here, you're left with a lot of duplicate days and a lot of NaN s.在这里,你留下了很多重复的日子和很多NaN Collapse your dataframe by doing something like the following:通过执行以下操作折叠您的 dataframe:

output_df = (
    df
    .groupby(by=common_cols)  # Only keep distinct values for the common cols
    .max()                    # Max will prefer non-nan values over nans
    .reset_index()            # Collapse the multi-index
    .sort_values(common_cols) # Sort by all these to get it nice and orderly
    .reset_index(drop=True)   # Tidy up the dataframe index
)
output_df

在此处输入图像描述

I believe this is the type of output that OP is asking for.我相信这是 OP 要求的 output 类型。

As for there being no NaN s in the final product, that'll really depend on the data coverage over all variable for all years and days.至于最终产品中没有NaN ,这实际上取决于所有年日所有变量的数据覆盖率。

Use pd.concat method to do this efficiently.使用pd.concat方法可以有效地做到这一点。 The method does the work by listing all the data frames in vertical order and also creates new columns for all the new variables.该方法通过按垂直顺序列出所有数据框来完成工作,并为所有新变量创建新列。

Here is an example of how pd.concat works I created with duplicate data.这是我使用重复数据创建的pd.concat如何工作的示例。

CODE代码

import pandas as pd

df1 = pd.DataFrame({"country": ["USA", "USA", "USA"], "region": ["Iowa", "Iowa", "Iowa"],
                    "region_id": [12345, 12345, 12345], "year": [2022, 2022, 2022], "doy": [1, 2, 3],
                    "variable_a": [32.2, 12.2, 22.2], "num_pixles": [100, 100, 100]})

df2 = pd.DataFrame({"country": ["USA", "USA", "USA"], "region": ["Iowa", "Iowa", "Iowa"],
                    "region_id": [12345, 12345, 12345], "year": [2020, 2020, 2020], "doy": [1, 2, 3],
                    "variable_b": [54.2, 62.2, 2.2], "num_pixles": [100, 100, 100]})

df_list = [df1, df2]  # list of dataframes

res = pd.concat(df_list) # concat the list of dataframes
res = res.sort_values(by="year").reset_index(drop=True)  # To make sure that the rows are sorted based on year
print(res)

OUTPUT OUTPUT

      country region  region_id  year  doy  variable_a  num_pixles  variable_b
0     USA   Iowa      12345  2020    1         NaN         100        54.2
1     USA   Iowa      12345  2020    2         NaN         100        62.2
2     USA   Iowa      12345  2020    3         NaN         100         2.2
3     USA   Iowa      12345  2022    1        32.2         100         NaN
4     USA   Iowa      12345  2022    2        12.2         100         NaN
5     USA   Iowa      12345  2022    3        22.2         100         NaN

All methods所有方法

@René @René

res = pd.concat(dfn, axis=0).sort_values(['year', 'country', 'region']).reset_index(drop=True)

Timing (average of 1000 runs):计时(平均 1000 次运行):

0.003601184606552124

@vignesh kanakavalli @vignesh kanakavalli

res = pd.concat(df_list)
res = res.sort_values(by="year").reset_index(drop=True)

Timing (average of 1000 runs):计时(平均 1000 次运行):

0.002223911762237549

@DialFrost @DialFrost

def fast_flatten(input_list, df):
  r = list(chain.from_iterable(input_list))
  r += [np.nan] * (len(df.index)*3 - len(r))
  return list(r)

def combine_lists(frames):
  COLUMN_NAMES = [frames[i].columns for i in range(len(frames))]
  COL_NAMES = list(set(list(chain(*COLUMN_NAMES))))
  df_dict = dict.fromkeys(COL_NAMES, [])
  for col in COL_NAMES:
    extracted = (frame[col] for frame in frames if col in frame.columns.tolist())
    df_dict[col] = fast_flatten(extracted, dfn[0])
  return pd.DataFrame.from_dict(df_dict)[COL_NAMES]

res = combine_lists(dfn)
res = res.sort_values(by = "year").reset_index(drop=True)

Timing (1000 runs):计时(1000 次运行):

0.0021250741481781007

Explanation of my code :我的代码解释

Here I used a trick.在这里,我使用了一个技巧。 Instead of using pd.concat , I decided to go for appending.我决定不使用pd.concat ,而是使用 go 进行附加。 Especially for larger dataframes, use the appending method found on github here - (I used a slightly modified version of the code).特别是对于较大的数据帧,请使用github的附加方法 - (我使用了稍微修改过的代码版本)。 This one is slightly large enough to beat pd.concat and wins in efficiency.这个略大,足以击败pd.concat并在效率上获胜。


All tests use the same dataframe :所有测试都使用相同的 dataframe

import pandas as pd

data1 = {
    'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
    'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowa'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2022, 1: 2022, 2: 2022, 3: 2022, 4: 2022},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_a': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

data2 = {
    'country': {0: 'USB', 1: 'USB', 2: 'USB', 3: 'USB', 4: 'USB'},
    'region': {0: ' Iowb', 1: ' Iowb', 2: ' Iowb', 3: ' Iowb', 4: ' Iowb'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2021, 1: 2021, 2: 2021, 3: 2021, 4: 2021},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_b': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

data3 = {
    'country': {0: 'USC', 1: 'USC', 2: 'USC', 3: 'USC', 4: 'USC'},
    'region': {0: ' Iowc', 1: ' Iowc', 2: ' Iowc', 3: ' Iowc', 4: ' Iowc'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_c1': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'variable_c2': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)

dfn = [df1, df2, df3]

All tests used this timing control :所有测试都使用了这个时序控制

for x in range(1000):
  start = time.time()
  .
  .
  .
  end = time.time()
  lst.append(end-start)
print(sum(lst)/len(lst))
print(res)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM