I have the following data frame:
sp_id sp_dt v1 v1 v3
x1|x2|x30|x40 2018-10-07 100 200 300
x1|x2|x30|x40 2018-10-14 80 80 90
x1|x2|x30|x40 2018-10-21 34 35 36
x1|x2|x31|x41 2018-10-07 100 200 300
x1|x2|x31|x41 2018-10-14 80 80 90
x1|x2|x31|x41 2018-10-21 34 35 36
....
x1|x2|x39|x49 2018-10-21 340 350 36
and an excel file that has the following data(and each sheet in the excel may contain multiple variables like v4, v5 as shown below, possibly v6 in another sheet):
Variable sp_partid1 sp_partid2 2018-10-07 ... 2018-10-21
v4 x30 x40 160 ... 154
v4 x31 x41 59 ... 75
....
v4 x39 x49 75 ... 44
v5 x30 x40 16 ... 24
v5 x31 x41 59 ... 79
....
v5 x39 x49 75 ... 34
sp_partid1 and sp_partid2 are optional columns. They are "part of sp_id" column in the top data frame. The file can have none or, in this specific example, upto 4 such columns, each a part of sp_id column in the data frame on the top.
The final output should look like:
sp_id sp_dt v1 v1 v3 v4 v5
x1|x2|x30|x40 2018-10-07 100 200 300 160 16
x1|x2|x30|x40 2018-10-14 80 80 90 ... ...
x1|x2|x30|x40 2018-10-21 34 35 36 154 24
x1|x2|x31|x41 2018-10-07 100 200 300 59 59
x1|x2|x31|x41 2018-10-14 80 80 90 ... ...
x1|x2|x31|x41 2018-10-21 34 35 36 75 79
....
x1|x2|x39|x49 2018-10-21 340 350 36 44 34
Edit1 starts: How is the output generated?
get a list of variables
check if the variable(say v4 in this case) exists in any sheet
if it does:
does it have any "part of sp_id"
#In the example shown sp_partid1 and sp_partid2 of excel sheets
#are part of sp_id of dataframe.
if yes:
#it means the part of sp_id is common for all values. (x1|x2) in this case.
add a new column to dataframe, v4, which has sp_id, sp_dt and,
the value of that date
if no:
#it means the whol sp_id is common for all values. (x1|x2|x3|x4) in this case and not shown in example.
add a new column to dataframe, v4, and copy the value under the appropriate dates in excel sheet into corresponding v4 values and sp_dt
As an example 160 is the value under 2018-10-07 for v4, x30, x40 so v4 in the final output shows 160 in the first row.
Edit1 ends:
I started my code with:
df # is the top data frame which I have not gotten around to using yet
var_value # gets values in a loop like 'v4, v5...'
sheets_dict = {name: pd.read_excel('excel_file.xlsx', sheet_name = name, parse_dates = True) for name in sheets}
for key, value in sheets_dict.items():
if 'Variable' in value.columns:
# 'Variable' column exists in this sheet
if var_value in value['Variable'].values:
# var_value exists in 'Variable' column (say, v4)
for column in value.columns:
if column.startswith('sp_'):
#Do something with column values, then map the values etc
assuming one of your excel sheet has the below data,
Variable sp_partid1 sp_partid2 2018-10-07 2018-10-08 2018-10-21
0 v4 x30 x40 160 10.0 154
1 v4 x31 x41 59 NaN 75
2 v4 x32 x42 75 10.0 44
3 v5 x30 x40 16 10.0 24
4 v5 x31 x41 59 10.0 79
5 v5 x32 x42 75 10.0 34
you can use a combination of pandas melt
and pivot_table
function to get the desired result.
import pandas as pd
book= pd.read_excel('del.xlsx',sheet_name=None)
for df in book.values():
df=df.melt(id_vars=['Variable','sp_partid1','sp_partid2'], var_name="Date", value_name="Value")
# concatenate strings of two columns separated by a '|'
df['sp_id'] = df['sp_partid1'] +'|'+ df['sp_partid2']
df = df.loc[:,['Variable', 'sp_id','Date','Value']]
df = df.pivot_table('Value', ['sp_id','Date'], 'Variable').reset_index( drop=False )
print(df)
>> output
Variable sp_id Date v4 v5
0 x30|x40 2018-10-07 160.0 16.0
1 x30|x40 2018-10-08 10.0 10.0
2 x30|x40 2018-10-21 154.0 24.0
3 x31|x41 2018-10-07 59.0 59.0
4 x31|x41 2018-10-08 NaN 10.0
5 x31|x41 2018-10-21 75.0 79.0
6 x32|x42 2018-10-07 75.0 75.0
7 x32|x42 2018-10-08 10.0 10.0
8 x32|x42 2018-10-21 44.0 34.0
reading excel workbook with sheet_name=None will give a dictionary with worksheet name
as key
and a data frame
as value
What you are trying do makes sense, but it is quite a long sequence of operations, so it is normal that you have some trouble implementing it. I think you should step back to the higher level of abstraction of relational databases , and use the high-level dataframe operations offered by pandas.
Let's summarize what you want to do, in terms of high-level operations:
sheet_dicts
dataframes, such that it has the same data, but presented differently id3 id4 date v4 v5
x30 x40 2018-10-07 160 154
x31 x41 2018-10-08 30 10
I can't give you a precise implementation are you specification is still quite vague, even though the global goal is clear. Also, I don't have a reference to provide to guide you with relational database, but I highly recommend that you get informed, it will save you a lot of time, especially if you often have to perform such tasks.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.