简体   繁体   中英

python parsing beautiful soup data to csv

I have written code in python3 to parse an html/css table. Have a few issues with it:

  1. my csv output file headers are not generated based on html (tag: td, class: t1) by my code (on the first run when the output file is being created)
  2. if the incoming html table has a few additional fields (tag: td, class: t1) my code cannot currently capture them and create additional headers in the csv output file
  3. the data is not written to the output cvs file till ALL the ids (A001,A002,A003...) from my input file are processed. i want to write to the output cvs file when the processing of each id from my input file is completed (ie A001 to be written to csv before processing A002).
  4. whenever i rerun the code, the data does not begin from the next line in the output csv

Being a noob, I am sure my code is very rudimentary and there will be a better way to do this and would like to learn to write this better and fix the above as well.

Need advise & guidance, please help. Thank you.

My Code:

import csv
import requests
from bs4 import BeautifulSoup

## SIDs.csv contains ids in col2 based on which the 'url' variable pulls the respective data
SIDFile = open('SIDs.csv')
SIDReader = csv.reader(SIDFile)
SID = list(SIDReader)

SqID_data = []

#create and open output file
with open('output.csv','a', newline='') as csv_h:
    fields = \
    [
        "ID",
        "Financial Year",
        "Total Income",
        "Total Expenses",
        "Tax Expense",
        "Net Profit"
    ]

    for row in SID:
        col1,col2 = row
        SID ="%s" % (col2)

        url = requests.get("http://.......")
        soup = BeautifulSoup(url.text, "lxml")

        fy = soup.findAll('td',{'class':'tablehead'})
        titles = soup.findAll('td',{'class':'t1'})
        values = soup.findAll('td',{'class':'t0'})

        if titles:
            data = {}
            for title in titles:
                name = title.find("td", class_ = "t1")
            data["ID"] = SID
            data["Financial Year"] = fy[0].string.strip()
            data["Total Income"] = values[0].string.strip()
            data["Total Expenses"] = values[1].string.strip()
            data["Tax Expense"] = values[2].string.strip()
            data["Net Profit"] = values[3].string.strip()
            SqID_data.append(data)

    #Prepare CSV writer.
    writer = csv.DictWriter\
    (
        csv_h,
        fields,
        quoting        = csv.QUOTE_ALL,
        extrasaction   = "ignore",
        dialect        = "excel",
        lineterminator = "\n",
    )
    writer.writeheader()
    writer.writerows(SqID_data)
    print("write rows complete")

Excerpt of HTML being processed:

<p>
<TABLE border=0 cellspacing=1 cellpadding=6 align=center class="vTable">
   <TR>
    <TD class=tablehead>Financial Year</TD>
    <TD class=t1>01-Apr-2015 To 31-Mar-2016</TD>
   </TR>
</TABLE>
</p>

<p>
<br>
<table cellpadding=3 cellspacing=1 class=vTable>
<TR>
    <TD class=t1><b>Total income from operations (net) ( a + b)</b></td>
    <TD class=t0 nowrap>675529.00</td>
</tr>
<TR>
    <TD class=t1><b>Total expenses</b></td>
    <TD class=t0 nowrap>446577.00</td>
</tr>
<TR>
    <TD class=t1>Tax expense</td>
    <TD class=t0 nowrap>71708.00</td>
</tr>
<TR>
    <TD class=t1><b>Net Profit / (Loss)</b></td>
    <TD class=t0 nowrap>157621</td>
</tr>
</table>
</p>

SIDs.csv (no header row)

1,A0001
2,A0002
3,A0003

Expected Output: output.csv (create header row)

ID,Financial Year,Total Income,Total Expenses,Tax Expense,Net Profit,OtherFieldsAsAndWhenFound
A001,01-Apr-2015 To 31-Mar-2016,675529.00,446577.00,71708.00,157621.00
A002,....
A003,....

I would recommend looking at pandas.read_html for parsing your web data; on your sample data this gives you:

import pandas as pd
tables=pd.read_html(s, index_col=0)
tables[0]
Out[11]: 
                                         1
0                                         
Financial Year  01-Apr-2015 To 31-Mar-2016

tables[1]
                                                  1
0                                                  
Total income from operations (net) ( a + b)  675529
Total expenses                               446577
Tax expense                                   71708
Net Profit / (Loss)                          157621

You can then do what ever data manipulations you need (adding id's etc) using Pandas functions, and then export with DataFrame.to_csv .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM