This is about using python and pandas to read excel file, I have not been able to find a working example.
My file's name is something like:
2018 Historical Banking Record For Branch 12345.xlsx
The Excel has content like below (sorry I don't know how to attach the file to this post):
2 CD ABC PRODUCT
MA RI NH CT VT CA CR DE PHI NJ ON FL WA DX HW AK MI IL
01/01/18 1.01 1.61 1.80 1.46 1.69 1.73 1.64 1.64 1.74 1.71 1.68 1.74 1.68 1.87 1.77 2.04 2.05 1.76
01/08/18 2.01 2.61 2.80 2.46 2.69 2.73 2.64 2.64 2.74 2.71 2.68 1.73 1.67 1.84 1.74 2.06 2.04 1.76
01/15/18 3.01 3.61 3.80 3.46 3.69 3.73 3.64 3.64 3.74 3.71 3.68 1.74 1.68 1.86 1.75 2.06 2.04 1.76
01/22/18 4.01 4.61 4.80 4.46 4.69 4.73 4.64 4.64 4.74 4.71 4.68 1.76 1.74 1.73 1.66 1.93 1.84 1.87
01/29/18 5.01 5.61 5.80 5.46 2.01 5.73 1.82 5.64 5.74 5.71 5.68 1.74 1.72 1.71 1.62 1.91 1.82 1.85
My code is something like below:
import pandas as pd
xl = pd.ExcelFile("../data/sample.xlsx", engine='xlrd')
I am able to get the first row cells's value with
xl.book._sharedstrings[0] ~ xl.book._sharedstrings[18]
What I need to do is how to loop all the rows and get all the cell's value?
Eventually I need to generate a new dataframe with the structure like below:
product p_date region p_value c_date eom
CD ABC PRODUCT 01/01/18 MA 1.01 18/10/24 18/10/31
All the fields are explained as below:
p_date: should be from the first column:
01/01/18 01/08/18 01/15/18 01/22/18 01/29/18
region:
MA RI NH CT ....
p_value: the decimal under each region, eg 1.01
there are 18 regions in this sheet, meaning 18 records will be created for the new dataframe.
I am able to get all the cells except for the first column for p_date:
01/01/18
01/08/18
01/15/18
01/22/18
01/29/18
It seems to be from a "Series" as shown below, but I don't know how to retrieve value from it.
I can use list(df["MA"]) to convert Series df["MA"] to a list, but I still cannot get the p_date.
Ideally I need to loop each row when generate/append the dataframe
cur_row=[wampproduct, wamp_date, wampregion, rsp, wamp, date_pull, eom]
df_row = pd.DataFrame(columns=cols, data=cur_row)
df = df.append(df_row, ignore_index=True)
Thank you very much.
This type of operation is called a melt. It is essentially the inverse of pivoting a dataframe. Also, as Mathew pointed out in a comment, using the read_excel is a bit simpler since it directly returns a dataframe. The following code block runs the melt.
fname = ../data/sample.xlsx''
date_pull = pd.to_datetime('2018-10-18')
eom = pd.to_datetime('2018-10-31')
# get product name out of excel file
product = pd.read_excel(fname, nrows=1, header=None, usecols=[1])
product = product.loc[0, 0]
product
# load data from excel fail
df = pd.read_excel(fname, header=1)
# rename index to p_date and make a column
df.index.rename('p_date', inplace=True)
df = df.reset_index()
# add product to df
df['product'] = product
# melt
df = pd.melt(df, id_vars=['product', 'p_date'], var_name='region', value_name='p_value')
# add c_date and eom to data frame
df['c_date'] = date_pull
df['eom'] = eom
With @alexdor's code plus my own code I am now able to generate the needed result like below:
,product,p_date,region,p_value,c_date,eom
0,CD Short-Term WAMP,2010-01-01,MA,0.8763918845487475,201812,2018-12-31
1,CD Short-Term WAMP,2010-01-08,MA,0.8600652449166932,201812,2018-12-31
2,CD Short-Term WAMP,2010-01-15,MA,0.8593079486202981,201812,2018-12-31
To remove the sequence number which will cause issue later, set the index=False as below:
df_csv.to_csv(physical_file, index=False)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.