BeautifulSoup HTML表解析，表中的表

Question

我正在使用BeautifulSoup来解析HTML表中的数据。 通常，HTML运行如下：

        <tr><td width="35%"style="font-style: italic;">Owner:</td><td>MS 'CEMSOL' Schifffahrtsgesellschaft mbH & Co. KG</td></tr>
        <tr><td width="35%"style="font-style: italic;">Connecting District:</td><td>HAMBURG (HBR)</td></tr>
        <tr><td width="35%"style="font-style: italic;">Flag:</td><td>CYPRUS</td></tr>
        <tr><td width="35%"style="font-style: italic;">Port of Registry:</td><td>LIMASSOL</td></tr>
    </tbody></table>

但是然后有几节：

<table class="table1"><thead><tr><th style="width: 140px" class="veristarTableUHeader">Classification</th><th class="top"><a href="#top">Top</a></th></tr></thead><tbody><tr><td width="35%" valign="top"style="font-style: italic;">Main Class Symbols:</td><td>I<img src='/asms2-portlet/images/particulars/croixsoul.gif'/> Hull&nbsp;&nbsp;&nbsp;<img src='/asms2-portlet/images/particulars/croixsoul.gif'/>Mach</td></tr>
    <tr><td width="35%" valign="top"style="font-style: italic;">Service Notations:</td><td valign="top">
        <table class="empty">
            <tr>

                <td>General cargo ship /cement carrier</td>
            </tr>
        </table>
    </td></tr>
    <tr><td width="35%" valign="top"style="font-style: italic;">Navigation Notations:</td><td>Unrestricted navigation<br></td></tr>
    <tr><td width="35%" valign="top"style="font-style: italic;">Additional Class Notation(s):</td><td><img src='/asms2-portlet/images/particulars/croixsoul.gif'/> AUT-UMS , <img src='/asms2-portlet/images/particulars/croixsoul.gif'/> ICE CLASS IA</td></tr>
    <tr><td width="35%" valign="top"style="font-style: italic;">Machinery:</td><td valign="top">
        <table class="empty">
            <tr>
                <td width="20"><img src='/asms2-portlet/images/particulars/croixsoul.gif'/></td>
                <td>MACH</td>
            </tr>
        </table>
    </td></tr>

</tbody></table>

源.txt： ShipData问题是，有一个额外的<tr>标记，它通过将一个表列加倍来弄乱答案。 一般货船/水泥运输船将其两次添加到“ 值”列表中，因为tr内有一个新表。 ['12536D', '9180401', 'CEMSOL', 'C4WH2', 'General cargo ship', "MS 'CEMSOL' Schifffahrtsgesellschaft mbH & Co. KG", 'HAMBURG (HBR)', 'CYPRUS', 'LIMASSOL', 'I Hull\\xc2\\xa0\\xc2\\xa0\\xc2\\xa0Mach', 'General cargo ship /cement carrier', 'General cargo ship /cement carrier', 'Unrestricted navigation', ' AUT-UMS , ICE CLASS IA', 'MACH']

我的代码如下：

# -*- coding: utf-8 -*-

import csv
# from urllib import urlopen
import urllib2
from bs4 import BeautifulSoup
import re
import time
import socket
fm = open('ShipURL.txt', 'r')
Shiplinks = fm.readlines()  # Reads the line count from the first line.
check = ""
columnHeaders = ""
with open('ShipData.csv', 'wb')as f:  # Creates an empty csv file to which assign values.
    writer = csv.writer(f)
    for line in Shiplinks:
        time.sleep(2)
        website = re.findall(r'(https?://\S+)', line)
        website = "".join(str(x) for x in website)
        print website
        if check != website:
            if website != "":
                check = website
                shipUrl = website
                while True:
                    try:
                        shipPage = urllib2.urlopen(shipUrl, timeout=1.5)
                    except urllib2.URLError:
                        continue
                    except socket.timeout:
                        print "socket timeout!"
                    break
                soup = BeautifulSoup(shipPage, "html.parser")  # Read the web page HTML
                table = soup.find_all("table", {"class": "table1"})  # Finds table with class table1
                List = []
                columnRow = ""
                valueRow = ""
                Values = []
                for mytable in table:                                   #Loops tables with class table1
                    table_body = mytable.find('tbody')                  #Finds tbody section in table
                    try:                                                #If tbody exists
                        rows = table_body.find_all('tr')                #Finds all rows
                        for tr in rows:                                 #Loops rows
                            cols = tr.find_all('td')                    #Finds the columns
                            i = 1                                       #Variable to control the lines
                            for td in cols:                             #Loops the columns
            ##                  print td.text                           #Displays the output
                                co = td.text                            #Saves the column to a variable
            ##                    writer.writerow([co])                 Writes the variable in CSV file row
                                if i == 1:                              #Checks the control variable, if it equals to 1
                                    if td.text[-1] == ":":              # - : adds a ,
                                        columnRow += td.text.strip(":") + " , "  # One string
                                        List.append(td.text.encode("utf-8"))                #.. takes the column value and assigns it to a list called 'List' and..
                                        i += 1                                #..Increments i by one

                                else:
                                    # võtab reavahetused maha ja lisab koma stringile
                                    valueRow += td.text.strip("\n") + " , "
                                    Values.append(td.text.strip("\n").encode("utf-8"))              #Takes the second columns value and assigns it to a list called Values
                                #print List                             #Checking stuff
                                #print Values                           #Checking stuff
                    except:
                        print "no tbody"
                # Prints headings and values with row change.
                print columnRow.strip(",")
                print "\n"
                print valueRow.strip(",")
                # encode'ing hakkas jälle kiusama
                # Kirjutab esimeseks reaks veeru pealkirjad ja teiseks väärtused
                if not columnHeaders:
                    writer.writerow(List)
                columnHeaders = columnRow
                writer.writerow(Values)
                #
fm.close()

Answer 1

我知道这并不能完全帮助您不必解析多余的表标记，但是听起来您只需要一个解决方案，所以您就不会继续在列表中获取2个值！

如果我是你，我会使用一个使它成为可能的集合，这样您就只能在集合中使用相同的值之一。

在python 2.7及更高版本中进行设置是：

List = {}
Values = {}

注意：如果仍然无法使用2.7之前的版本：

List = ([])
Values = ([])

我可能会将名称更改为：

Set        = {}
Set_values = {}

之后，您可以继续更改代码的最后一部分以解决问题！

if td.text[-1] == ":":                        
     columnRow += td.text.strip(":") + " , "  
     List.append(td.text.encode("utf-8"))     
     i += 1  
else:
   valueRow += td.text.strip("\n") + " , "
   Values.append(td.text.strip("\n").encode("utf-8"))

在那里我会使用：

if td.text[-1] == ":":                        
     columnRow += td.text.strip(":") + " , "  
     Set.add(td.text.encode("utf-8"))   #<---Here is the change  
     i += 1 
else:
    valueRow += td.text.strip("\n") + " , "
    Set_values.add(td.text.strip("\n").encode("utf-8")) #<---Here is another change

这样做将使您在写入CSV时仅拥有相同的值之一。

如果CSV编写者不喜欢集合和更好地喜欢列表，则可以通过执行以下操作将集合变成文件末尾的列表。

my_list  = list(Set)
my_list2 = list(Set_values)

Answer 2

解决了if子句t = 0和t <1的问题，程序将td.text元素追加到列表中。 这样可以保证仅添加一个元素。

# -*- coding: utf-8 -*-

import csv
# from urllib import urlopen
import urllib2
from bs4 import BeautifulSoup
import re
import time
import socket
fm = open('ShipURL.txt', 'r')
Shiplinks = fm.readlines()  # Reads the line count from the first line.
check = ""
columnHeaders = ""
with open('ShipData.csv', 'wb')as f:  # Creates an empty csv file to which assign values.
    writer = csv.writer(f)
    for line in Shiplinks:
        time.sleep(2)
        website = re.findall(r'(https?://\S+)', line)
        website = "".join(str(x) for x in website)
        print website
        if check != website:
            if website != "":
                check = website
                shipUrl = website
                while True:
                    try:
                        shipPage = urllib2.urlopen(shipUrl, timeout=1.5)
                    except urllib2.URLError:
                        continue
                    except socket.timeout:
                        print "socket timeout!"
                    break
                soup = BeautifulSoup(shipPage, "html.parser")  # Read the web page HTML
                table = soup.find_all("table", {"class": "table1"})  # Finds table with class table1
                List = []
                columnRow = ""
                valueRow = ""
                Values = []
                for mytable in table:                                   #Loops tables with class table1
                    table_body = mytable.find('tbody')                  #Finds tbody section in table
                    try:                                                #If tbody exists
                        rows = table_body.find_all('tr')                #Finds all rows
                        for tr in rows:                                 #Loops rows
                            cols = tr.find_all('td')                    #Finds the columns
                            i = 1                                       #Variable to control the lines
                            t = 0
                            for td in cols:                             #Loops the columns
            ##                  print td.text                           #Displays the output
                                co = td.text                           #Saves the column to a variable
            ##                    writer.writerow([co])                 Writes the variable in CSV file row
                                if i == 1:                              #Checks the control variable, if it equals to 1
                                    if td.text[-1] == ":":              # - : adds a ,
                                        columnRow += td.text.strip(":") + " , "  # One string
                                        List.append(td.text.encode("utf-8"))                #.. takes the column value and assigns it to a list called 'List' and..
                                        i += 1                                #..Increments i by one

                                else:
                                    # võtab reavahetused maha ja lisab koma stringile
                                    if t<1:
                                        valueRow += td.text.strip("\n") + " , "
                                        Values.append(td.text.strip("\n").encode("utf-8"))              #Takes the second columns value and assigns it to a list called Values
                                        t += 1
                                #print List                             #Checking stuff
                                #print Values                           #Checking stuff
                    except:
                        print "no tbody"
                # Prints headings and values with row change.
                print columnRow.strip(",")
                print "\n"
                print valueRow.strip(",")
                # encode'ing hakkas jälle kiusama
                # Kirjutab esimeseks reaks veeru pealkirjad ja teiseks väärtused
                if not columnHeaders:
                    writer.writerow(List)
                columnHeaders = columnRow
                writer.writerow(Values)
                #
fm.close()

BeautifulSoup HTML表解析，表中的表

问题描述

2 个解决方案

解决方案1
0 2016-02-02 16:30:06

解决方案2
0 2016-02-02 17:34:56

BeautifulSoup HTML表解析，表中的表

问题描述

2 个解决方案

解决方案1 0 2016-02-02 16:30:06

解决方案2 0 2016-02-02 17:34:56

解决方案1
0 2016-02-02 16:30:06

解决方案2
0 2016-02-02 17:34:56