[英]Stack dataframes in Pandas vertically and horizontally
I have a dataframe that looks like this:我有一个看起来像这样的 dataframe:
country region region_id year doy variable_a num_pixels
0 USA Iowa 12345 2022 1 32.2 100
1 USA Iowa 12345 2022 2 12.2 100
2 USA Iowa 12345 2022 3 22.2 100
3 USA Iowa 12345 2022 4 112.2 100
4 USA Iowa 12345 2022 5 52.2 100
The year in the dataframe above is 2022. I have more dataframes for other years starting from 2010 onwards.上面的 dataframe 中的年份是 2022 年。从 2010 年开始,我有更多其他年份的数据框。 I have also dataframes for other variables:
variable_b
, variable_c
.我还有其他变量的数据框:
variable_b
, variable_c
。
I want to combine all these dataframes into a single dataframe such that我想将所有这些数据帧组合成一个 dataframe 这样
country region region_id year doy variable_a variable_b variable_c
0 USA Iowa 12345 2010 1 32.2 44 101
1 USA Iowa 12345 2010 2 12.2 76 2332
..........................................................................
n-1 USA Iowa 12345 2022 1 321.2 444 501
n USA Iowa 12345 2022 2 122.2 756 32
What is the most efficient way to achieve this?实现这一目标的最有效方法是什么? Please note that there will be overlap in years in the other dataframes so the solution needs to take that into account and not leave NaN values.
请注意,其他数据帧中的年份会有重叠,因此解决方案需要考虑到这一点,而不是留下 NaN 值。
IIUC, this should work for you: IIUC,这应该适合你:
data1 = {
'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowa'},
'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
'year': {0: 2022, 1: 2022, 2: 2022, 3: 2022, 4: 2022},
'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'variable_a': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}
data2 = {
'country': {0: 'USB', 1: 'USB', 2: 'USB', 3: 'USB', 4: 'USB'},
'region': {0: ' Iowb', 1: ' Iowb', 2: ' Iowb', 3: ' Iowb', 4: ' Iowb'},
'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
'year': {0: 2021, 1: 2021, 2: 2021, 3: 2021, 4: 2021},
'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'variable_b': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}
data3 = {
'country': {0: 'USC', 1: 'USC', 2: 'USC', 3: 'USC', 4: 'USC'},
'region': {0: ' Iowc', 1: ' Iowc', 2: ' Iowc', 3: ' Iowc', 4: ' Iowc'},
'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
'year': {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020},
'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'variable_c1': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'variable_c2': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)
dfn = [df1, df2, df3]
pd.concat(dfn, axis=0).sort_values(['year', 'country', 'region']).reset_index(drop=True)
Output: Output:
I'm not sure people are hearing the second parts of your question:我不确定人们是否听到了您问题的第二部分:
the data for the different variables is listed horizontally.
不同变量的数据水平列出。
and和
there will be overlap in years in the other dataframes so the solution needs to take that into account and not leave NaN values.
其他数据帧中的年份将重叠,因此解决方案需要考虑到这一点,而不是留下 NaN 值。
I think I understand, and this is my solution.我想我明白了,这就是我的解决方案。
We start by creating a baby dataset of two years, five days each, with two variables.我们首先创建一个两年的婴儿数据集,每五天,有两个变量。
import pandas as pd
# Baseline dummy data
data = {
'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowa'},
'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
'year': {0: 2022, 1: 2022, 2: 2022, 3: 2022, 4: 2022},
'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}
# 2022 data with "a" data
df_2022_a = pd.DataFrame(data)
df_2022_a["variable_a"] = range(5)
# 2022 data with "b" data
df_2022_b = pd.DataFrame(data)
df_2022_b["variable_b"] = range(5, 10)
# 2021 data with "a" data
df_2021_a = pd.DataFrame(data)
df_2021_a["variable_a"] = range(10, 15)
df_2021_a["year"] = 2021
# 2021 data with "b" data
df_2021_b = pd.DataFrame(data)
df_2021_b["variable_b"] = range(15, 20)
df_2021_b["year"] = 2021
frames = [df_2022_a, df_2022_b, df_2021_a, df_2021_b]
# Get the columns that they all share. This is what we'll group by.
# You can hard-code this if you want
common_cols = list(set.intersection(*(set(df.columns) for df in frames)))
# Yes, go ahead and concatenate them together... but there's one more step!
df = pd.concat(frames)
df
Here, you're left with a lot of duplicate days and a lot of NaN
s.在这里,你留下了很多重复的日子和很多
NaN
。 Collapse your dataframe by doing something like the following:通过执行以下操作折叠您的 dataframe:
output_df = (
df
.groupby(by=common_cols) # Only keep distinct values for the common cols
.max() # Max will prefer non-nan values over nans
.reset_index() # Collapse the multi-index
.sort_values(common_cols) # Sort by all these to get it nice and orderly
.reset_index(drop=True) # Tidy up the dataframe index
)
output_df
I believe this is the type of output that OP is asking for.我相信这是 OP 要求的 output 类型。
As for there being no NaN
s in the final product, that'll really depend on the data coverage over all variable for all years and days.至于最终产品中没有
NaN
,这实际上取决于所有年日所有变量的数据覆盖率。
Use pd.concat
method to do this efficiently.使用
pd.concat
方法可以有效地做到这一点。 The method does the work by listing all the data frames in vertical order and also creates new columns for all the new variables.该方法通过按垂直顺序列出所有数据框来完成工作,并为所有新变量创建新列。
Here is an example of how pd.concat
works I created with duplicate data.这是我使用重复数据创建的
pd.concat
如何工作的示例。
CODE代码
import pandas as pd
df1 = pd.DataFrame({"country": ["USA", "USA", "USA"], "region": ["Iowa", "Iowa", "Iowa"],
"region_id": [12345, 12345, 12345], "year": [2022, 2022, 2022], "doy": [1, 2, 3],
"variable_a": [32.2, 12.2, 22.2], "num_pixles": [100, 100, 100]})
df2 = pd.DataFrame({"country": ["USA", "USA", "USA"], "region": ["Iowa", "Iowa", "Iowa"],
"region_id": [12345, 12345, 12345], "year": [2020, 2020, 2020], "doy": [1, 2, 3],
"variable_b": [54.2, 62.2, 2.2], "num_pixles": [100, 100, 100]})
df_list = [df1, df2] # list of dataframes
res = pd.concat(df_list) # concat the list of dataframes
res = res.sort_values(by="year").reset_index(drop=True) # To make sure that the rows are sorted based on year
print(res)
OUTPUT OUTPUT
country region region_id year doy variable_a num_pixles variable_b
0 USA Iowa 12345 2020 1 NaN 100 54.2
1 USA Iowa 12345 2020 2 NaN 100 62.2
2 USA Iowa 12345 2020 3 NaN 100 2.2
3 USA Iowa 12345 2022 1 32.2 100 NaN
4 USA Iowa 12345 2022 2 12.2 100 NaN
5 USA Iowa 12345 2022 3 22.2 100 NaN
res = pd.concat(dfn, axis=0).sort_values(['year', 'country', 'region']).reset_index(drop=True)
Timing (average of 1000 runs):计时(平均 1000 次运行):
0.003601184606552124
res = pd.concat(df_list)
res = res.sort_values(by="year").reset_index(drop=True)
Timing (average of 1000 runs):计时(平均 1000 次运行):
0.002223911762237549
def fast_flatten(input_list, df):
r = list(chain.from_iterable(input_list))
r += [np.nan] * (len(df.index)*3 - len(r))
return list(r)
def combine_lists(frames):
COLUMN_NAMES = [frames[i].columns for i in range(len(frames))]
COL_NAMES = list(set(list(chain(*COLUMN_NAMES))))
df_dict = dict.fromkeys(COL_NAMES, [])
for col in COL_NAMES:
extracted = (frame[col] for frame in frames if col in frame.columns.tolist())
df_dict[col] = fast_flatten(extracted, dfn[0])
return pd.DataFrame.from_dict(df_dict)[COL_NAMES]
res = combine_lists(dfn)
res = res.sort_values(by = "year").reset_index(drop=True)
Timing (1000 runs):计时(1000 次运行):
0.0021250741481781007
Explanation of my code :我的代码解释:
Here I used a trick.在这里,我使用了一个技巧。 Instead of using
pd.concat
, I decided to go for appending.我决定不使用
pd.concat
,而是使用 go 进行附加。 Especially for larger dataframes, use the appending method found on github here - (I used a slightly modified version of the code).特别是对于较大的数据帧,请使用github上的附加方法 - (我使用了稍微修改过的代码版本)。 This one is slightly large enough to beat
pd.concat
and wins in efficiency.这个略大,足以击败
pd.concat
并在效率上获胜。
All tests use the same dataframe :所有测试都使用相同的 dataframe :
import pandas as pd
data1 = {
'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowa'},
'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
'year': {0: 2022, 1: 2022, 2: 2022, 3: 2022, 4: 2022},
'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'variable_a': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}
data2 = {
'country': {0: 'USB', 1: 'USB', 2: 'USB', 3: 'USB', 4: 'USB'},
'region': {0: ' Iowb', 1: ' Iowb', 2: ' Iowb', 3: ' Iowb', 4: ' Iowb'},
'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
'year': {0: 2021, 1: 2021, 2: 2021, 3: 2021, 4: 2021},
'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'variable_b': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}
data3 = {
'country': {0: 'USC', 1: 'USC', 2: 'USC', 3: 'USC', 4: 'USC'},
'region': {0: ' Iowc', 1: ' Iowc', 2: ' Iowc', 3: ' Iowc', 4: ' Iowc'},
'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
'year': {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020},
'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'variable_c1': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'variable_c2': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)
dfn = [df1, df2, df3]
All tests used this timing control :所有测试都使用了这个时序控制:
for x in range(1000):
start = time.time()
.
.
.
end = time.time()
lst.append(end-start)
print(sum(lst)/len(lst))
print(res)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.