简体   繁体   中英

How can i extract data from an excel sheet embedded in html using python and beautifulsoup?

So i got the idea of extracting data from a table on a webpage, so that i can average it, visually represent it, and work with it. I've tried using python with beautifulsoup to get the data, but I still end up with the weird excel formatting code in the beggining that looks like this:

<!--table
    {mso-displayed-decimal-separator:"\.";
    mso-displayed-thousand-separator:"\,";}
@page
    {margin:1.0in .75in 1.0in .75in;
    mso-header-margin:.51in;
    mso-footer-margin:.51in;}
.style0
    {mso-number-format:General;
    text-align:general;
    vertical-align:bottom;
    white-space:nowrap;
    mso-rotate:0;
    mso-background-source:auto;
...(more of the same)
...

-->

I've looked at the source code of the page and it includes:

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 14">

How can i extract the data in a meaningful way, that preserves it and allows it to be manipulated? thank you for your time.

My current script just uses curl to get the html file, then opens the html file and uses beautifulsoup get_text on it, and saves this to a text file.

Are you doing something like this?:

 import BeautifulSoup
 s = BeautifulSoup.BeautifulSoup(html)
 table = s.find("table", {"id": "mytableid"})
 try:
     rows = table.findAll('tr')
     for tr in rows:
         cols = tr.findAll('td')
         for td in cols:
             val = td.text

I can't give you a better answer until you improve your question.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM