简体   繁体   English

如何使用BeautifulSoup解析多个表并将其保存到csv文件

[英]How to parse multiple tables using BeautifulSoup and save them to a csv file

I've got a program, almost finished, it is only missing this last part where I'm struggliing. 我有一个程序,快要完成了,只剩下我苦苦挣扎的最后一部分。 I need to scrap from a lot of webpages (If you need to see an example you need to go on this site http://www.pa.org.mt/page.aspx?n=63C70E73&CaseType=PA and fill the case number with 03732 and case year with 16 and click on the first submit.) the tables on the contentHolder div and write them to a csv file, to get something like this: Case Status, Status Available, Case Number, PA/03732/16 Location of development: 40 .... Something like this to all the tables on a webpage and for a lot of webpages. 我需要从很多网页中剪贴(如果需要查看示例,则需要转到此网站http://www.pa.org.mt/page.aspx?n=63C70E73&CaseType=PA,并填写案例编号年份为03732的年份,年份为16的年份,然后单击第一个提交。)contentHolder div上的表并将其写入csv文件,以获取如下信息:案例状态,可用状态,案例编号,PA / 03732/16位置开发数量:40 ....网页上的所有表格以及许多网页上都类似这样的内容。 I wrote some code trying to do this, but it is not working, when I run it, it makes this output on the csv file: https://gyazo.com/6557ac08ad5613a24b5432bfd9e4f2e6 And it doesn't even finishes all the pages because It returns an error in the middle: 我写了一些代码尝试执行此操作,但是它不起作用,当我运行它时,它将在csv文件上输出: https : //gyazo.com/6557ac08ad5613a24b5432bfd9e4f2e6而且它甚至没有完成所有页面,因为在中间返回错误:

Traceback (most recent call last):
  File "C:\PROJECT\pdfs\converterpluspa.py", line 93, in <module>
    csv.writer(f).writerow(answer)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)

Heres's my program's entire code so far: 这是到目前为止我程序的全部代码:

import shlex
import subprocess
import os
import platform
from bs4 import BeautifulSoup
import re
import csv
import pickle
import requests
from robobrowser import RoboBrowser
import codecs

def rename_files():
    file_list = os.listdir(r"C:\\PROJECT\\pdfs")
    print(file_list)
    saved_path = os.getcwd()
    print('Current working directory is '+saved_path)
    os.chdir(r'C:\\PROJECT\\pdfs')
    for file_name in file_list:
        os.rename(file_name, file_name.translate(None, " "))
    os.chdir(saved_path)
rename_files()

def run(command):
     if platform.system() != 'Windows':
         args = shlex.split(command)
    else:
        args = command
    s = subprocess.Popen(args,
                          stdout=subprocess.PIPE,
                         stderr=subprocess.PIPE)
    output, errors = s.communicate()
    return s.returncode == 0, output, errors

# Change this to your PDF file base directory
base_directory = 'C:\\PROJECT\\pdfs'
if not os.path.isdir(base_directory):
    print "%s is not a directory" % base_directory
    exit(1)
 # Change this to your pdf2htmlEX executable location
bin_path = 'C:\\Python27\\pdfminer-20140328\\tools\\pdf2txt.py'
if not os.path.isfile(bin_path):
    print "Could not find %s" % bin_path
    exit(1)
for dir_path, dir_name_list, file_name_list in os.walk(base_directory):
    for file_name in file_name_list:
        # If this is not a PDF file
        if not file_name.endswith('.pdf'):
            # Skip it
            continue
        file_path = os.path.join(dir_path, file_name)
        # Convert your PDF to HTML here
        args = (bin_path, file_name, file_path)
        success, output, errors = run("python %s -o %s.html %s " %args)
        if not success:
            print "Could not convert %s to HTML" % file_path
            print "%s" % errors
htmls_path = 'C:\\PROJECT'
with open ('score.csv', 'w') as f:
    writer = csv.writer(f)
    for dir_path, dir_name_list, file_name_list in os.walk(htmls_path):
        for file_name in file_name_list:
            if not file_name.endswith('.html'):
                continue
            with open(file_name) as markup:
                soup = BeautifulSoup(markup.read())
                text = soup.get_text()
                match = re.findall("PA/(\S*)", text)#To remove the names that appear, just remove the last (\S*), to add them is just add the (\S*), before it there was a \s*
                print(match)
                writer.writerow(match)
                 for item in match:
                    data = item.split('/')
                    case_number = data[0]
                    case_year = data[1]

                browser = RoboBrowser()
                browser.open('http://www.pa.org.mt/page.aspx?n=63C70E73&CaseType=PA')
                form = browser.get_forms()[0]  # Get the first form on the page
                form['ctl00$PageContent$ContentControl$ctl00$txtCaseNo'].value = case_number
                form['ctl00$PageContent$ContentControl$ctl00$txtCaseYear'].value = case_year

                browser.submit_form(form, submit=form['ctl00$PageContent$ContentControl$ctl00$btnSubmit'])

                # Use BeautifulSoup to parse this data
                answer = browser.response.text
                print(answer)
                soup = BeautifulSoup(answer)
                #print soup.prettify()
                status = soup.select('#Table1')
                print (status)
                with codecs.open('file_output.csv', 'a', encoding ='utf-8') as f:
                  for tag in soup.select("#Table1"):
                    csv.writer(f).writerow(answer)

EDIT: I tried to change the last line to csv.writer(f).writerow(answer.encode("utf-8")) and it didn't worked, it printed another error message : 编辑:我试图将最后一行更改为csv.writer(f).writerow(answer.encode("utf-8")) ,但它没有起作用,它打印了另一条错误消息:

Traceback (most recent call last):
  File "C:\PROJECT\pdfs\converterpluspa.py", line 93, in <module>
    csv.writer(f).writerow(answer.encode("utf-8"))
  File "C:\Python27\lib\codecs.py", line 706, in write
    return self.writer.write(data)
  File "C:\Python27\lib\codecs.py", line 369, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 25496: ordinal not in range(128)

And nothing channged on the final csv file. 最终的csv文件没有任何变化。

You need to encode your output using UTF-8. 您需要使用UTF-8对输出进行编码。 Change the last line to: 将最后一行更改为:

csv.writer(f, encoding="utf-8").writerow(answer.encode("utf-8"))

Also change the import from import csv to import unicodecsv as csv import unicodecsv as csv导入从import csv更改为import unicodecsv as csv

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用python / BeautifulSoup或类似工具将kml文件解析为csv? - How to parse a kml file to csv using python/BeautifulSoup or similar? Python:使用BeautifulSoup将HTML转换为CSV与多个表 - Python : HTML to CSV with Multiple Tables using BeautifulSoup 我想依次打开目录文件夹中的 html 文件,使用 beautifulsoup 解析信息并将其保存为 csv 文件 - I want to open html files in a directory folder sequentially, parse information by using beautifulsoup and save it as a csv file 美丽的汤,如何抓取多个网址并将其保存在csv文件中 - Beautiful soup, how to scrape multiple urls and save them in a csv file beautifulsoup-多个表示多个值,以及如何将它们保存到JSON中 - beautifulsoup - Multiple for for multiple values and how to save them into JSON 解析具有多个维度表的单个CSV文件 - Parse single CSV file with multiple dimension tables 如何使用 Python 解析多个 XML 文件并将它们保存到 CSV 文件中? - How do I parse multiples XML file with Python and save them into a CSV file? 如何从同一个数据库中读取许多表并将它们保存到自己的CSV文件中? - How to read many tables from the same database and save them to their own CSV file? BeautifulSoup4无法解析多个表 - BeautifulSoup4 fails to parse multiple tables 如何在python中打开多个doc文件并从中解析电子邮件ID并写入csv? - How to open multiple doc file in python and parse email ID from them and write in csv?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM