[英]How do I massage data from multiple columns and multiple files into single data frame?
我有以下數據框:
sp_id sp_dt v1 v1 v3
x1|x2|x30|x40 2018-10-07 100 200 300
x1|x2|x30|x40 2018-10-14 80 80 90
x1|x2|x30|x40 2018-10-21 34 35 36
x1|x2|x31|x41 2018-10-07 100 200 300
x1|x2|x31|x41 2018-10-14 80 80 90
x1|x2|x31|x41 2018-10-21 34 35 36
....
x1|x2|x39|x49 2018-10-21 340 350 36
和具有以下數據的excel文件(並且excel中的每個工作表可能包含多個變量,例如v4,v5,如下所示,可能在另一個工作表中包含v6):
Variable sp_partid1 sp_partid2 2018-10-07 ... 2018-10-21
v4 x30 x40 160 ... 154
v4 x31 x41 59 ... 75
....
v4 x39 x49 75 ... 44
v5 x30 x40 16 ... 24
v5 x31 x41 59 ... 79
....
v5 x39 x49 75 ... 34
sp_partid1和sp_partid2是可選列。 它們是頂部數據框中的“ sp_id的一部分”列。 該文件可以沒有任何列,或者在此特定示例中,最多可以包含4個這樣的列,每個列都是頂部數據框中的sp_id列的一部分。
最終輸出應如下所示:
sp_id sp_dt v1 v1 v3 v4 v5
x1|x2|x30|x40 2018-10-07 100 200 300 160 16
x1|x2|x30|x40 2018-10-14 80 80 90 ... ...
x1|x2|x30|x40 2018-10-21 34 35 36 154 24
x1|x2|x31|x41 2018-10-07 100 200 300 59 59
x1|x2|x31|x41 2018-10-14 80 80 90 ... ...
x1|x2|x31|x41 2018-10-21 34 35 36 75 79
....
x1|x2|x39|x49 2018-10-21 340 350 36 44 34
Edit1開始:如何生成輸出?
get a list of variables
check if the variable(say v4 in this case) exists in any sheet
if it does:
does it have any "part of sp_id"
#In the example shown sp_partid1 and sp_partid2 of excel sheets
#are part of sp_id of dataframe.
if yes:
#it means the part of sp_id is common for all values. (x1|x2) in this case.
add a new column to dataframe, v4, which has sp_id, sp_dt and,
the value of that date
if no:
#it means the whol sp_id is common for all values. (x1|x2|x3|x4) in this case and not shown in example.
add a new column to dataframe, v4, and copy the value under the appropriate dates in excel sheet into corresponding v4 values and sp_dt
例如,160是v4,x30,x40在2018-10-07下的值,因此最終輸出中的v4在第一行中顯示160。
Edit1結束:
我從以下代碼開始:
df # is the top data frame which I have not gotten around to using yet
var_value # gets values in a loop like 'v4, v5...'
sheets_dict = {name: pd.read_excel('excel_file.xlsx', sheet_name = name, parse_dates = True) for name in sheets}
for key, value in sheets_dict.items():
if 'Variable' in value.columns:
# 'Variable' column exists in this sheet
if var_value in value['Variable'].values:
# var_value exists in 'Variable' column (say, v4)
for column in value.columns:
if column.startswith('sp_'):
#Do something with column values, then map the values etc
假設您的一張Excel工作表包含以下數據,
Variable sp_partid1 sp_partid2 2018-10-07 2018-10-08 2018-10-21
0 v4 x30 x40 160 10.0 154
1 v4 x31 x41 59 NaN 75
2 v4 x32 x42 75 10.0 44
3 v5 x30 x40 16 10.0 24
4 v5 x31 x41 59 10.0 79
5 v5 x32 x42 75 10.0 34
你可以使用熊貓的組合melt
和pivot_table
函數來獲得期望的結果。
import pandas as pd
book= pd.read_excel('del.xlsx',sheet_name=None)
for df in book.values():
df=df.melt(id_vars=['Variable','sp_partid1','sp_partid2'], var_name="Date", value_name="Value")
# concatenate strings of two columns separated by a '|'
df['sp_id'] = df['sp_partid1'] +'|'+ df['sp_partid2']
df = df.loc[:,['Variable', 'sp_id','Date','Value']]
df = df.pivot_table('Value', ['sp_id','Date'], 'Variable').reset_index( drop=False )
print(df)
>> output
Variable sp_id Date v4 v5
0 x30|x40 2018-10-07 160.0 16.0
1 x30|x40 2018-10-08 10.0 10.0
2 x30|x40 2018-10-21 154.0 24.0
3 x31|x41 2018-10-07 59.0 59.0
4 x31|x41 2018-10-08 NaN 10.0
5 x31|x41 2018-10-21 75.0 79.0
6 x32|x42 2018-10-07 75.0 75.0
7 x32|x42 2018-10-08 10.0 10.0
8 x32|x42 2018-10-21 44.0 34.0
讀取具有sheet_name = None的excel工作簿將給出一個以worksheet name
為key
, data frame
為value
的字典
您嘗試做的事情是有道理的,但是操作序列很長,因此在實現它時遇到一些麻煩是正常的。 我認為您應該回到關系數據庫的更高層次的抽象,並使用熊貓提供的高級數據框操作。
讓我們總結一下您想做的高級操作:
sheet_dicts
數據sheet_dicts
的格式,使其具有相同的數據,但呈現方式不同 id3 id4 date v4 v5
x30 x40 2018-10-07 160 154
x31 x41 2018-10-08 30 10
即使全局目標很明確,我也不能給您一個精確的實現,即使您的說明仍然很模糊。 另外,我沒有提供參考資料來指導您使用關系數據庫,但是我強烈建議您了解情況,這將為您節省很多時間,尤其是在您經常必須執行此類任務時。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.