简体   繁体   中英

How to retrieve value from series created in pandas dataframe in python

This is about using python and pandas to read excel file, I have not been able to find a working example.

My file's name is something like:

2018 Historical Banking Record For Branch 12345.xlsx

The Excel has content like below (sorry I don't know how to attach the file to this post):

 2  CD ABC PRODUCT                                                                  
    MA  RI  NH  CT  VT  CA  CR  DE  PHI NJ  ON  FL  WA  DX  HW  AK  MI  IL
01/01/18    1.01    1.61    1.80    1.46    1.69    1.73    1.64    1.64    1.74    1.71    1.68    1.74    1.68    1.87    1.77    2.04    2.05    1.76
01/08/18    2.01    2.61    2.80    2.46    2.69    2.73    2.64    2.64    2.74    2.71    2.68    1.73    1.67    1.84    1.74    2.06    2.04    1.76
01/15/18    3.01    3.61    3.80    3.46    3.69    3.73    3.64    3.64    3.74    3.71    3.68    1.74    1.68    1.86    1.75    2.06    2.04    1.76
01/22/18    4.01    4.61    4.80    4.46    4.69    4.73    4.64    4.64    4.74    4.71    4.68    1.76    1.74    1.73    1.66    1.93    1.84    1.87
01/29/18    5.01    5.61    5.80    5.46    2.01    5.73    1.82    5.64    5.74    5.71    5.68    1.74    1.72    1.71    1.62    1.91    1.82    1.85

在此处输入图片说明

My code is something like below:

import pandas as pd
xl = pd.ExcelFile("../data/sample.xlsx", engine='xlrd')

I am able to get the first row cells's value with

xl.book._sharedstrings[0] ~ xl.book._sharedstrings[18]

What I need to do is how to loop all the rows and get all the cell's value?

Eventually I need to generate a new dataframe with the structure like below:

product p_date region p_value c_date eom
CD ABC PRODUCT 01/01/18 MA 1.01 18/10/24 18/10/31

All the fields are explained as below:

  1. product: for this sheet, it is always the same: CD ABC PRODUCT
  2. p_date: should be from the first column:

     01/01/18 01/08/18 01/15/18 01/22/18 01/29/18 
  3. region:

     MA RI NH CT .... 
  4. p_value: the decimal under each region, eg 1.01

  5. c_date: today's date, 18/10/24
  6. eom: the last date for this month, 18/10/31

there are 18 regions in this sheet, meaning 18 records will be created for the new dataframe.

I am able to get all the cells except for the first column for p_date:

01/01/18
01/08/18
01/15/18
01/22/18
01/29/18

It seems to be from a "Series" as shown below, but I don't know how to retrieve value from it.

在此处输入图片说明

I can use list(df["MA"]) to convert Series df["MA"] to a list, but I still cannot get the p_date.

Ideally I need to loop each row when generate/append the dataframe

cur_row=[wampproduct, wamp_date, wampregion, rsp, wamp, date_pull, eom]
df_row = pd.DataFrame(columns=cols, data=cur_row)
df = df.append(df_row, ignore_index=True)

Thank you very much.

This type of operation is called a melt. It is essentially the inverse of pivoting a dataframe. Also, as Mathew pointed out in a comment, using the read_excel is a bit simpler since it directly returns a dataframe. The following code block runs the melt.

fname = ../data/sample.xlsx''
date_pull = pd.to_datetime('2018-10-18')
eom =  pd.to_datetime('2018-10-31')

# get product name out of excel file
product = pd.read_excel(fname, nrows=1, header=None, usecols=[1])
product = product.loc[0, 0]
product

# load data from excel fail
df = pd.read_excel(fname, header=1)

# rename index to p_date and make a column
df.index.rename('p_date', inplace=True)
df = df.reset_index()

# add product to df
df['product'] = product

# melt 
df = pd.melt(df, id_vars=['product', 'p_date'], var_name='region', value_name='p_value')

# add c_date and eom to data frame
df['c_date'] = date_pull
df['eom'] = eom

With @alexdor's code plus my own code I am now able to generate the needed result like below:

,product,p_date,region,p_value,c_date,eom
0,CD Short-Term WAMP,2010-01-01,MA,0.8763918845487475,201812,2018-12-31
1,CD Short-Term WAMP,2010-01-08,MA,0.8600652449166932,201812,2018-12-31
2,CD Short-Term WAMP,2010-01-15,MA,0.8593079486202981,201812,2018-12-31

To remove the sequence number which will cause issue later, set the index=False as below:

df_csv.to_csv(physical_file, index=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM