简体   繁体   中英

Cleaning an Excel File with Python so it can be parsed with Pandas

I am trying to access a file located at https://www.cmegroup.com/CmeWS/exp/voiProductDetailsViewExport.ctl?media=xls&tradeDate=20180709&reportType=P&productId=425 and I am experiencing some difficulty. At first, my progress was stymied by some incorrect request flags, but now that i'm sending "User-Agent": "Mozilla/5.0 i'm getting a proper response.

When I the file is effectively downloaded (as an .xls) I notice there are a large number of the same logo pasted over and over again in the top left corner (ranging from row 1 through approximately 3). I realize that Pandas cannot parse a file that has an image in it. I have been searching far and wide and have yet to find an example where you could delete all instances of an image from an Excel file and leave only the text.

My thought process was to somehow find the objects of the specific worksheet and then delete all of those until left with only text data, but this is proving more difficult than expected. The code below currently generates a TypeError: unsupported operand type(s) for <<: 'str' and 'int' Any help or guidance would be greatly appreciated.

def get_sheet(self):
        # Accesses CME direct URL (at the moment...will add functionality for ICE later)
        # Gets sheet and puts it in dataframe
        #Returns dataframe sheet

        sheet_url = "http://www.cmegroup.com/CmeWS/exp/voiProductDetailsViewExport.ctl?media=xls&tradeDate="+str(self.date_of_report)+"&reportType="\
        + str(self.report_type)+"&productId=" + str(self.product)

        header = {
            "User-Agent": "Mozilla/5.0"
        }

        req = requests.get(url = sheet_url, headers = header)

        file_obj = io.StringIO(req.content.decode('ISO-8859-1'))

        data_sheet = pd.read_excel(file_obj)

        return data_sheet

EDIT: Please see the full stack error below.

Traceback (most recent call last):
  File "OI_driver.py", line 16, in <module>
    OI_driver()
  File "OI_driver.py", line 10, in OI_driver
    front_month = mgd.Month_Data(product_dict["LO"], "06/27/2018", "P")
  File "D:\Open Interest Report Dev\month_graph_data.py", line 12, in __init__
    self.data_sheet = self.get_sheet()
  File "D:\Open Interest Report Dev\month_graph_data.py", line 30, in get_sheet
    data_sheet = pd.read_excel(file_obj)
  File "C:\Users\Tyler\Anaconda3\lib\site-packages\pandas\util\_decorators.py", line 177, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\Tyler\Anaconda3\lib\site-packages\pandas\util\_decorators.py", line 177, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\Tyler\Anaconda3\lib\site-packages\pandas\io\excel.py", line 307, in read_excel
    io = ExcelFile(io, engine=engine)
  File "C:\Users\Tyler\Anaconda3\lib\site-packages\pandas\io\excel.py", line 392, in __init__
    self.book = xlrd.open_workbook(file_contents=data)
  File "C:\Users\Tyler\Anaconda3\lib\site-packages\xlrd\__init__.py", line 162, in open_workbook
    ragged_rows=ragged_rows,
  File "C:\Users\Tyler\Anaconda3\lib\site-packages\xlrd\book.py", line 91, in open_workbook_xls
    biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
  File "C:\Users\Tyler\Anaconda3\lib\site-packages\xlrd\book.py", line 1267, in getbof
    opcode = self.get2bytes()
  File "C:\Users\Tyler\Anaconda3\lib\site-packages\xlrd\book.py", line 672, in get2bytes
    return (BYTES_ORD(hi) << 8) | BYTES_ORD(lo)
TypeError: unsupported operand type(s) for <<: 'str' and 'int'

how about save the content into a local file first?

import io  
import requests  
import pandas as pd 

url = "https://www.cmegroup.com/CmeWS/exp/voiProductDetailsViewExport.ctl?media=xls&tradeDate=20180709&reportType=P&productId=425"  
req = requests.get(url)  
xls_file = "tmp.xls"  

with open(xls_file, "w") as f:  
    f.write(req.content)

ds = pd.read_excel(xls_file)
print(ds)

Work for me

import requests
import io
import pandas as pd

url = '......'
response = requests.get(url, stream=True, headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36'})
file_obj = io.BytesIO(response.content)
df = pd.read_excel(file_obj)
print(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM