如何将来自多个列和多个文件的数据按摩到单个数据帧中？

Question

I have the following data frame: 我有以下数据框：

  sp_id         sp_dt          v1      v1      v3

x1|x2|x30|x40   2018-10-07     100     200     300 
x1|x2|x30|x40   2018-10-14     80       80      90  
x1|x2|x30|x40   2018-10-21     34       35      36 
x1|x2|x31|x41   2018-10-07     100     200     300 
x1|x2|x31|x41   2018-10-14     80       80      90  
x1|x2|x31|x41   2018-10-21     34       35      36   
....
x1|x2|x39|x49   2018-10-21     340      350     36

and an excel file that has the following data(and each sheet in the excel may contain multiple variables like v4, v5 as shown below, possibly v6 in another sheet): 和具有以下数据的excel文件（并且excel中的每个工作表可能包含多个变量，例如v4，v5，如下所示，可能在另一个工作表中包含v6）：

Variable      sp_partid1  sp_partid2    2018-10-07  ... 2018-10-21
  v4            x30         x40              160     ...   154
  v4            x31         x41              59      ...   75
  ....
  v4            x39         x49              75      ...   44
  v5            x30         x40              16      ...   24
  v5            x31         x41              59      ...   79
  ....
  v5            x39         x49              75      ...   34

sp_partid1 and sp_partid2 are optional columns. sp_partid1和sp_partid2是可选列。 They are "part of sp_id" column in the top data frame. 它们是顶部数据框中的“ sp_id的一部分”列。 The file can have none or, in this specific example, upto 4 such columns, each a part of sp_id column in the data frame on the top. 该文件可以没有任何列，或者在此特定示例中，最多可以包含4个这样的列，每个列都是顶部数据框中的sp_id列的一部分。

The final output should look like: 最终输出应如下所示：

  sp_id         sp_dt          v1      v1      v3     v4    v5
x1|x2|x30|x40   2018-10-07     100     200     300    160   16  
x1|x2|x30|x40   2018-10-14     80       80      90    ...   ...
x1|x2|x30|x40   2018-10-21     34       35      36    154   24
x1|x2|x31|x41   2018-10-07     100     200     300    59    59
x1|x2|x31|x41   2018-10-14     80       80      90    ...   ...
x1|x2|x31|x41   2018-10-21     34       35      36    75    79
....
x1|x2|x39|x49   2018-10-21     340      350     36    44    34

Edit1 starts: How is the output generated? Edit1开始：如何生成输出？

get a list of variables
check if the variable(say v4 in this case) exists in any sheet
if it does:
  does it have any "part of sp_id" 
  #In the example shown sp_partid1 and sp_partid2 of excel sheets 
  #are part of sp_id of dataframe.
  if yes:
  #it means the part of sp_id is common for all values. (x1|x2) in this case. 
      add a new column to dataframe, v4, which has sp_id, sp_dt and,
      the value of that date 
  if no:
  #it means the whol sp_id is common for all values. (x1|x2|x3|x4) in this case and not shown in example.
      add a new column to dataframe, v4, and copy the value under the appropriate dates in excel sheet into corresponding v4 values and sp_dt

As an example 160 is the value under 2018-10-07 for v4, x30, x40 so v4 in the final output shows 160 in the first row. 例如，160是v4，x30，x40在2018-10-07下的值，因此最终输出中的v4在第一行中显示160。

Edit1 ends: Edit1结束：

I started my code with: 我从以下代码开始：

df # is the top data frame which I have not gotten around to using yet
var_value # gets values in a loop like 'v4, v5...'

sheets_dict = {name: pd.read_excel('excel_file.xlsx', sheet_name = name, parse_dates = True) for name in sheets}

for key, value in sheets_dict.items():
   if 'Variable' in value.columns:
   # 'Variable' column exists in this sheet
      if var_value in value['Variable'].values:
      # var_value exists in 'Variable' column (say, v4)
          for column in value.columns:
             if column.startswith('sp_'):
                #Do something with column values, then map the values etc

Answer 1

assuming one of your excel sheet has the below data, 假设您的一张Excel工作表包含以下数据，

  Variable sp_partid1 sp_partid2  2018-10-07  2018-10-08  2018-10-21
0       v4        x30        x40         160        10.0         154
1       v4        x31        x41          59         NaN          75
2       v4        x32        x42          75        10.0          44
3       v5        x30        x40          16        10.0          24
4       v5        x31        x41          59        10.0          79
5       v5        x32        x42          75        10.0          34

you can use a combination of pandas melt and pivot_table function to get the desired result. 你可以使用熊猫的组合melt和pivot_table函数来获得期望的结果。

import pandas as pd
book= pd.read_excel('del.xlsx',sheet_name=None)
for df in book.values():
    df=df.melt(id_vars=['Variable','sp_partid1','sp_partid2'], var_name="Date", value_name="Value")
    # concatenate strings of two columns separated by a '|'
    df['sp_id'] = df['sp_partid1'] +'|'+ df['sp_partid2']
    df = df.loc[:,['Variable', 'sp_id','Date','Value']]
    df = df.pivot_table('Value', ['sp_id','Date'], 'Variable').reset_index( drop=False )
    print(df)  

>> output
Variable    sp_id        Date     v4    v5
0         x30|x40  2018-10-07  160.0  16.0
1         x30|x40  2018-10-08   10.0  10.0
2         x30|x40  2018-10-21  154.0  24.0
3         x31|x41  2018-10-07   59.0  59.0
4         x31|x41  2018-10-08    NaN  10.0
5         x31|x41  2018-10-21   75.0  79.0
6         x32|x42  2018-10-07   75.0  75.0
7         x32|x42  2018-10-08   10.0  10.0
8         x32|x42  2018-10-21   44.0  34.0

reading excel workbook with sheet_name=None will give a dictionary with worksheet name as key and a data frame as value 读取具有sheet_name = None的excel工作簿将给出一个以worksheet name为key ， data frame为value的字典

Answer 2

What you are trying do makes sense, but it is quite a long sequence of operations, so it is normal that you have some trouble implementing it. 您尝试做的事情是有道理的，但是操作序列很长，因此在实现它时遇到一些麻烦是正常的。 I think you should step back to the higher level of abstraction of relational databases , and use the high-level dataframe operations offered by pandas. 我认为您应该回到关系数据库的更高层次的抽象，并使用熊猫提供的高级数据框操作。

Let's summarize what you want to do, in terms of high-level operations: 让我们总结一下您想做的高级操作：

Change the format of the sheet_dicts dataframes, such that it has the same data, but presented differently 更改sheet_dicts数据sheet_dicts的格式，使其具有相同的数据，但呈现方式不同

   id3           id4        date            v4         v5       
   x30           x40        2018-10-07      160        154
   x31           x41        2018-10-08      30         10

Split the ids of the original dataframe in several columns. 将原始数据帧的ID分成几列。
Join the resulting dataframes with the original one on id and date . 加入 与ID和日期与原所产生的dataframes。

I can't give you a precise implementation are you specification is still quite vague, even though the global goal is clear. 即使全局目标很明确，我也不能给您一个精确的实现，即使您的说明仍然很模糊。 Also, I don't have a reference to provide to guide you with relational database, but I highly recommend that you get informed, it will save you a lot of time, especially if you often have to perform such tasks. 另外，我没有提供参考资料来指导您使用关系数据库，但是我强烈建议您了解情况，这将为您节省很多时间，尤其是在您经常必须执行此类任务时。

如何将来自多个列和多个文件的数据按摩到单个数据帧中？

问题描述

2 个解决方案

解决方案1
0 2019-08-16 08:57:23

解决方案2
0 2019-08-16 09:10:44

如何将来自多个列和多个文件的数据按摩到单个数据帧中？

问题描述

2 个解决方案

解决方案1 0 2019-08-16 08:57:23

解决方案2 0 2019-08-16 09:10:44

解决方案1
0 2019-08-16 08:57:23

解决方案2
0 2019-08-16 09:10:44