简体   繁体   English

python将漂亮的汤数据解析为csv

[英]python parsing beautiful soup data to csv

I have written code in python3 to parse an html/css table. 我已经在python3中编写了代码来解析html / css表。 Have a few issues with it: 有一些问题:

  1. my csv output file headers are not generated based on html (tag: td, class: t1) by my code (on the first run when the output file is being created) 我的csv输出文件标头不是由我的代码基于html(标记:td,类:t1)生成的(在创建输出文件时的第一次运行中)
  2. if the incoming html table has a few additional fields (tag: td, class: t1) my code cannot currently capture them and create additional headers in the csv output file 如果传入的html表具有一些其他字段(标记:td,类:t1),则我的代码当前无法捕获它们并在csv输出文件中创建其他标题
  3. the data is not written to the output cvs file till ALL the ids (A001,A002,A003...) from my input file are processed. 在处理完我输入文件中的所有ID(A001,A002,A003 ...)之前,数据不会写入输出cvs文件。 i want to write to the output cvs file when the processing of each id from my input file is completed (ie A001 to be written to csv before processing A002). 我想在输入文件中每个ID的处理完成时(即在处理A002之前将A001写入csv),将其写入输出cvs文件。
  4. whenever i rerun the code, the data does not begin from the next line in the output csv 每当我重新运行代码时,数据就不会从输出csv中的下一行开始

Being a noob, I am sure my code is very rudimentary and there will be a better way to do this and would like to learn to write this better and fix the above as well. 作为菜鸟,我确信我的代码非常基础,并且会有更好的方法来做到这一点,并且希望学习更好地编写并修复上面的问题。

Need advise & guidance, please help. 需要建议和指导,请帮助。 Thank you. 谢谢。

My Code: 我的代码:

import csv
import requests
from bs4 import BeautifulSoup

## SIDs.csv contains ids in col2 based on which the 'url' variable pulls the respective data
SIDFile = open('SIDs.csv')
SIDReader = csv.reader(SIDFile)
SID = list(SIDReader)

SqID_data = []

#create and open output file
with open('output.csv','a', newline='') as csv_h:
    fields = \
    [
        "ID",
        "Financial Year",
        "Total Income",
        "Total Expenses",
        "Tax Expense",
        "Net Profit"
    ]

    for row in SID:
        col1,col2 = row
        SID ="%s" % (col2)

        url = requests.get("http://.......")
        soup = BeautifulSoup(url.text, "lxml")

        fy = soup.findAll('td',{'class':'tablehead'})
        titles = soup.findAll('td',{'class':'t1'})
        values = soup.findAll('td',{'class':'t0'})

        if titles:
            data = {}
            for title in titles:
                name = title.find("td", class_ = "t1")
            data["ID"] = SID
            data["Financial Year"] = fy[0].string.strip()
            data["Total Income"] = values[0].string.strip()
            data["Total Expenses"] = values[1].string.strip()
            data["Tax Expense"] = values[2].string.strip()
            data["Net Profit"] = values[3].string.strip()
            SqID_data.append(data)

    #Prepare CSV writer.
    writer = csv.DictWriter\
    (
        csv_h,
        fields,
        quoting        = csv.QUOTE_ALL,
        extrasaction   = "ignore",
        dialect        = "excel",
        lineterminator = "\n",
    )
    writer.writeheader()
    writer.writerows(SqID_data)
    print("write rows complete")

Excerpt of HTML being processed: 正在处理的HTML的摘录:

<p>
<TABLE border=0 cellspacing=1 cellpadding=6 align=center class="vTable">
   <TR>
    <TD class=tablehead>Financial Year</TD>
    <TD class=t1>01-Apr-2015 To 31-Mar-2016</TD>
   </TR>
</TABLE>
</p>

<p>
<br>
<table cellpadding=3 cellspacing=1 class=vTable>
<TR>
    <TD class=t1><b>Total income from operations (net) ( a + b)</b></td>
    <TD class=t0 nowrap>675529.00</td>
</tr>
<TR>
    <TD class=t1><b>Total expenses</b></td>
    <TD class=t0 nowrap>446577.00</td>
</tr>
<TR>
    <TD class=t1>Tax expense</td>
    <TD class=t0 nowrap>71708.00</td>
</tr>
<TR>
    <TD class=t1><b>Net Profit / (Loss)</b></td>
    <TD class=t0 nowrap>157621</td>
</tr>
</table>
</p>

SIDs.csv (no header row) SIDs.csv(无标题行)

1,A0001
2,A0002
3,A0003

Expected Output: output.csv (create header row) 预期输出:output.csv(创建标题行)

ID,Financial Year,Total Income,Total Expenses,Tax Expense,Net Profit,OtherFieldsAsAndWhenFound
A001,01-Apr-2015 To 31-Mar-2016,675529.00,446577.00,71708.00,157621.00
A002,....
A003,....

I would recommend looking at pandas.read_html for parsing your web data; 我建议您查看pandas.read_html来解析您的网络数据; on your sample data this gives you: 在样本数据上,这将为您提供:

import pandas as pd
tables=pd.read_html(s, index_col=0)
tables[0]
Out[11]: 
                                         1
0                                         
Financial Year  01-Apr-2015 To 31-Mar-2016

tables[1]
                                                  1
0                                                  
Total income from operations (net) ( a + b)  675529
Total expenses                               446577
Tax expense                                   71708
Net Profit / (Loss)                          157621

You can then do what ever data manipulations you need (adding id's etc) using Pandas functions, and then export with DataFrame.to_csv . 然后,您可以使用Pandas函数执行所需的数据操作(添加id的操作),然后使用DataFrame.to_csv导出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM