简体   繁体   English

如何将包含不同位置的时间序列数据的多个 Pandas 数据帧合并到一个 X 数组中?

[英]How can I combine multiple Pandas dataframes that contain time series data for different locations, into a single X-array?

I started with a downloaded.csv file that had the columns 'Series Name' then 'Country Name' (with the country name rows being repeated for each different series) then annual timesteps of the data in the following columns.我从下载的 .csv 文件开始,该文件包含“系列名称”列,然后是“国家名称”列(每个不同系列重复国家名称行),然后是以下各列中数据的年度时间步长。 I have split this into a series of Pandas dataframes, each with an index of the Country Name then the data for a single Series Name(variable) by timestep, these are saved as Pickle files by their Series Name.我将其拆分为一系列 Pandas 数据帧,每个数据帧都有国家名称的索引,然后是按时间步长的单个系列名称(变量)的数据,这些按系列名称保存为 Pickle 文件。

For example:例如:

¦ Country name ¦ 1960 ¦ 1961 ¦ 1962 ¦ ... ¦ 2021 ¦
___
¦ Albania      ¦ 1000 ¦ 1001 ¦ 1002 ¦ ... ¦ 1061 ¦

¦ Andorra      ¦ 2000 ¦ 2001 ¦ 2001 ¦ ... ¦ 2061 ¦

etc

数据框示例

I would like to build a X-array from these using 'Country Name' as one coordinate and 'Year' as another (time) coordinate then add the data from each series as a different variable.我想使用“国家名称”作为一个坐标,使用“年份”作为另一个(时间)坐标从这些构建一个 X 数组,然后将每个系列的数据添加为不同的变量。

The Country Name index column and the 'Year' column headers are the same in each dataframe, however, there are a different number of instances of Country Name compared to Year.每个 dataframe 中的 Country Name 索引列和“Year”列标题相同,但是,与 Year 相比,Country Name 的实例数量不同。

I am stuck on what I think should be the first step, of putting a single Series Pandas Dataframe into X-array.我坚持我认为应该是第一步,将单个系列 Pandas Dataframe 放入 X 数组。

I have tried the code below, both using the 'times' coordinate just generated from a list and from the current row-headers in the dataframe. However, the code really wants the two coordinates to both be the same length, which they aren't as they aren't x,y for a single point but counttry1, time1 has a value, country2 time2 has a value, country1, time2 has a value and so on.我已经尝试了下面的代码,都使用了刚刚从列表和 dataframe 中的当前行标题生成的“时间”坐标。但是,代码确实希望两个坐标的长度相同,但它们不是t 因为它们不是单个点的 x,y,而是 counttry1,time1 有一个值,country2 time2 有一个值,country1,time2 有一个值等等。

year = list(range(1960,2020))

data = series1

locs = series1['Country Name']

times = year

array = xr.DataArray(data, coords=[times, locs], dims=["time", "space"])

This gives the error:这给出了错误:

ValueError: conflicting sizes for dimension 'time': length 217 on the data but length 60 on coordinate 'time' ValueError:维度“时间”的大小冲突:数据长度为 217,但坐标“时间”长度为 60

完整的错误代码

The ultimate aim of this is to combine/compare this data with spatial time series data from a C.net file that I have imported, hence wanting to use X-array.这样做的最终目的是将此数据与我导入的 C.net 文件中的空间时间序列数据进行组合/比较,因此想使用 X-array。

I think the issue is that data in your example above is a one dimensional array, since it comes from a Series object.我认为问题在于上面示例中的data是一维数组,因为它来自 object Series Based on the DataFrame example you provided, I think something like the following would achieve what you want (perhaps modulo some data cleaning to convert the string year labels to integers, etc.):基于您提供的DataFrame示例,我认为类似以下的内容可以实现您想要的(可能以一些数据清理为模以将字符串年份标签转换为整数等):

melted = df.melt(
    id_vars=["Country Name"],
    value_vars=[f"{year} [YR{year}]" for year in range(1960, 1965)],
    var_name="year",
    value_name="population"
)
ds = melted.set_index(["Country Name", "year"]).to_xarray()
  • df.melt stacks the individual population columns for each year into a single column, creates a new year column to represent the year coordinate, and expands the country column to also appropriately label each population value in the stacked column. df.melt将每一年的各个人口列堆叠成一个列,创建一个新的年份列来表示年份坐标,并将国家列扩展到也适当地 label 堆叠列中的每个人口值。
  • melted.set_index then creates a MultiIndex out of the "Country Name" and "year" columns.然后, melted.set_index"Country Name""year"列中创建一个MultiIndex This allows pandas and xarray to recognize that this DataFrame is really a Dataset with a 2D data variable when to_xarray is called.这允许 pandas 和 xarray 在调用 to_xarray 时识别出这个to_xarray实际上是一个带有二维数据变量的数据集。

As Spencer Clark writes, the function "df.melt" will help you with exactly this task.正如 Spencer Clark 所写,function“df.melt”将帮助您完成这项任务。

I think you should accept his answer, or clarify where further issues arise我认为您应该接受他的回答,或者澄清出现进一步问题的地方

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在熊猫中将这一系列数据帧转换为时间序列? - How do I turn this series of dataframes into a time series in Pandas? 如何组合不同粒度的时间序列数据 - How to combine time series data with different granularity 如何从熊猫的一系列数据框中删除空数据框? - How can I remove empty dataframes from a series of dataframes in pandas? 如何在PANDAS中对具有不同索引的数据帧或系列进行计算? - How can I do computations on dataframes or series that have different indexes in PANDAS? 如何使用 Pandas 自动绘制来自非常大的时间序列的多个“块”数据? - How can I automate the plotting of multiple 'chunks' of data from a very large time-series using Pandas? 如何将两个 Pandas DataFrames 与不同的、不重叠的 MultiIndexes 结合起来? - How can I combine two Pandas DataFrames with different, non-overlapping MultiIndexes? 我如何将 Pandas DataFrames 与略有不同的列结合起来 - How would I combine Pandas DataFrames with slightly different columns 如何组合单个和多索引Pandas DataFrames - How to combine single and multiindex Pandas DataFrames Pandas 多个时间序列图单个数据帧 - Pandas Multiple Time Series Plots Single Data Frame 如何根据 Pandas 中的一列列表组合两个数据帧 - How can I combine two dataframes based on a column of lists in Pandas
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM