简体   繁体   English

用美丽的汤解析KML

[英]Parsing KML with Beautiful Soup

I am having trouble parsing a KML file (XML) using beautiful soup. 我在使用漂亮的汤解析KML文件(XML)时遇到了麻烦。 This snippet of code should have a non zero number of iterations for every level lxml returns in my sample 2 tables xml parser returns 0 and the number should be 3 在我的示例2表中,此代码段的每个级别的lxml返回的迭代次数均应为非零xml解析器返回0,且数量应为3

from bs4 import BeautifulSoup

url = "sample.kml"

with open(url,'r') as page:

    soup = BeautifulSoup(page, "lxml")

    tables = soup.find_all('table')
    print(len(tables))

    for table in tables:    
        rows = table.find_all('tr')

        for row in rows:    
            cols = row.find_all('td')

This first sample script returns 2 tables instead of 3 using lxml and 0 with xml parser. 第一个示例脚本使用lxml返回2个表,而不是3个,使用xml解析器返回0个表。

soup = BeautifulSoup(page, "xml")

    placemark = soup.find_all('Placemark')
    print(len(placemark))

    for place in placemark:

        tables = place.find_all('table')
        print(len(tables))

        for table in tables:    
            rows = table.find_all('tr')

            for row in rows:    
                cols = row.find_all('td')

traversing the tree I originally started searching for tables which len(tables) returned 2 which I know to be false should be about 92,000 so I found another tag to start stepping through the tree which was (returned correct count), and attempted to then find the rows and columns within each tag of which they all returned zero which surprised me. 遍历树时,我最初开始搜索len(tables)返回2的表,我知道该表为假,应该是92,000左右,所以我找到了另一个标记来开始遍历树(返回正确的计数),然后尝试找到每个标签中的行和列都返回零,这让我感到惊讶。 I played around with different parsers eventually determining that xml was the appropriate one however still was unable to find the correct amount of tables despite being able to find them using re.search or search in sublime text, this then lead me to check for ways it might have been encapsulated but to no avail. 我与不同的解析器一起玩耍,最终确定xml是合适的解析器,但是尽管能够使用re.search或以崇高的文字进行搜索,但仍然无法找到正确数量的表,这随后使我检查了它的方式可能已经封装,但无济于事。 I am quite stuck and cant seem to find a way to access the 92,000 tables using the find_all("TAG") method. 我很困惑,似乎找不到使用find_all(“ TAG”)方法访问92,000个表的方法。 any suggestions? 有什么建议么?

Sample KML 样本KML

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document id="laaSECS" xsi:schemaLocation="http://www.opengis.net/kml/2.2 http://schemas.opengis.net/kml/2.2.0/ogckml22.xsd http://www.google.com/kml/ext/2.2 http://code.google.com/apis/kml/schema/kml22gx.xsd">
    <name>laaSECS</name>
    <Snippet maxLines="0"></Snippet>
    <Style id="PolyStyle00">
        <LabelStyle>
            <color>00000000</color>
            <scale>0</scale>
        </LabelStyle>
        <LineStyle>
            <color>ff7f5555</color>
            <width>0.2</width>
        </LineStyle>
        <PolyStyle>
            <color>ffc5d9fa</color>
            <fill>0</fill>
        </PolyStyle>
    </Style>
    <Style id="PolyStyle000">
        <LabelStyle>
            <color>00000000</color>
            <scale>0</scale>
        </LabelStyle>
        <LineStyle>
            <color>ff7f5555</color>
            <width>0.2</width>
        </LineStyle>
        <PolyStyle>
            <color>ffc5d9fa</color>
            <fill>0</fill>
        </PolyStyle>
    </Style>
    <StyleMap id="PolyStyle001">
        <Pair>
            <key>normal</key>
            <styleUrl>#PolyStyle00</styleUrl>
        </Pair>
        <Pair>
            <key>highlight</key>
            <styleUrl>#PolyStyle000</styleUrl>
        </Pair>
    </StyleMap>
    <Folder id="FeatureLayer0">
        <name>laaSECS</name>
        <Snippet maxLines="0"></Snippet>
        <Placemark id="ID_00000">
            <name>AL</name>
            <Snippet maxLines="0"></Snippet>
            <description><![CDATA[<html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt">

<head>

<META http-equiv="Content-Type" content="text/html">

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

</head>

<body style="margin:0px 0px 0px 0px;overflow:auto;background:#FFFFFF;">

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-collapse:collapse;padding:3px 3px 3px 3px">

<tr style="text-align:center;font-weight:bold;background:#9CBCE2">

<td>AL</td>

</tr>

<tr>

<td>

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px">

<tr>

<td>FID</td>

<td>0</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>STATE</td>

<td>AL</td>

</tr>

<tr>

<td>MER</td>

<td>25</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>TWP</td>

<td>22</td>

</tr>

<tr>

<td>TDIR</td>

<td>N</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>RNG</td>

<td>4</td>

</tr>

<tr>

<td>RDIR</td>

<td>W</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>SEC</td>

<td>24</td>

</tr>

<tr>

<td>MODDATE</td>

<td>20050311</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>DATUM</td>

<td>NAD27</td>

</tr>

<tr>

<td>SOURCE</td>

<td>WhiteStar</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>MTR</td>

<td>25 22.0N  4.0W</td>

</tr>

</table>

</td>

</tr>

</table>

</body>

</html>]]></description>
            <styleUrl>#PolyStyle001</styleUrl>
            <MultiGeometry>
                <Polygon>
                    <outerBoundaryIs>
                        <LinearRing>
                            <coordinates>
                                -88.35570867858526,32.86011073571817,0 -88.35570870147141,32.86253443065814,0 -88.35597594524225,32.86011537400984,0 -88.35570867858526,32.86011073571817,0 
                            </coordinates>
                        </LinearRing>
                    </outerBoundaryIs>
                </Polygon>
            </MultiGeometry>
        </Placemark>
        <Placemark id="ID_00001">
            <name>AL</name>
            <Snippet maxLines="0"></Snippet>
            <description><![CDATA[<html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt">

<head>

<META http-equiv="Content-Type" content="text/html">

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

</head>

<body style="margin:0px 0px 0px 0px;overflow:auto;background:#FFFFFF;">

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-collapse:collapse;padding:3px 3px 3px 3px">

<tr style="text-align:center;font-weight:bold;background:#9CBCE2">

<td>AL</td>

</tr>

<tr>

<td>

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px">

<tr>

<td>FID</td>

<td>1</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>STATE</td>

<td>AL</td>

</tr>

<tr>

<td>MER</td>

<td>25</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>TWP</td>

<td>22</td>

</tr>

<tr>

<td>TDIR</td>

<td>N</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>RNG</td>

<td>4</td>

</tr>

<tr>

<td>RDIR</td>

<td>W</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>SEC</td>

<td>25</td>

</tr>

<tr>

<td>MODDATE</td>

<td>20050311</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>DATUM</td>

<td>NAD27</td>

</tr>

<tr>

<td>SOURCE</td>

<td>WhiteStar</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>MTR</td>

<td>25 22.0N  4.0W</td>

</tr>

</table>

</td>

</tr>

</table>

</body>

</html>]]></description>
            <styleUrl>#PolyStyle001</styleUrl>
            <MultiGeometry>
                <Polygon>
                    <outerBoundaryIs>
                        <LinearRing>
                            <coordinates>
                                -88.35597594524225,32.86011537400984,0 -88.3567389068841,32.85292852502473,0 -88.35768486975799,32.84508568993779,0 -88.35570853700197,32.84511675513796,0 -88.35570867858526,32.86011073571817,0 -88.35597594524225,32.86011537400984,0 
                            </coordinates>
                        </LinearRing>
                    </outerBoundaryIs>
                </Polygon>
            </MultiGeometry>
        </Placemark>
        <Placemark id="ID_00002">
            <name>AL</name>
            <Snippet maxLines="0"></Snippet>
            <description><![CDATA[<html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt">

<head>

<META http-equiv="Content-Type" content="text/html">

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

</head>

<body style="margin:0px 0px 0px 0px;overflow:auto;background:#FFFFFF;">

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-collapse:collapse;padding:3px 3px 3px 3px">

<tr style="text-align:center;font-weight:bold;background:#9CBCE2">

<td>AL</td>

</tr>

<tr>

<td>

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px">

<tr>

<td>FID</td>

<td>2</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>STATE</td>

<td>AL</td>

</tr>

<tr>

<td>MER</td>

<td>25</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>TWP</td>

<td>22</td>

</tr>

<tr>

<td>TDIR</td>

<td>N</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>RNG</td>

<td>4</td>

</tr>

<tr>

<td>RDIR</td>

<td>W</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>SEC</td>

<td>36</td>

</tr>

<tr>

<td>MODDATE</td>

<td>20050311</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>DATUM</td>

<td>NAD27</td>

</tr>

<tr>

<td>SOURCE</td>

<td>WhiteStar</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>MTR</td>

<td>25 22.0N  4.0W</td>

</tr>

</table>

</td>

</tr>

</table>

</body>

</html>]]></description>
            <styleUrl>#PolyStyle001</styleUrl>
            <MultiGeometry>
                <Polygon>
                    <outerBoundaryIs>
                        <LinearRing>
                            <coordinates>
                                -88.35768486975799,32.84508568993779,0 -88.35843183642189,32.83843382961495,0 -88.35914980106479,32.83165897171819,0 -88.35908878782671,32.83049899571662,0 -88.35570839957039,32.83056244880483,0 -88.35570853700197,32.84511675513796,0 -88.35768486975799,32.84508568993779,0 
                            </coordinates>
                        </LinearRing>
                    </outerBoundaryIs>
                </Polygon>
            </MultiGeometry>
        </Placemark>
        <Placemark id="ID_00003">
            <name>AL</name>
            <Snippet maxLines="0"></Snippet>
            <description><![CDATA[<html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt">

Link to the originalfile KML FILE 链接到原始文件KML FILE

The root of your problem is you have an XML document with nested HTML documents in it. 问题的根源是您有一个包含嵌套HTML文档的XML文档。 Attempting to parse the whole thing has HTML isn't working because the HTML documents appear to be stored as a tag. 试图解析整个HTML内容是行不通的,因为HTML文档似乎是作为标签存储的。 As a result, while this is valid XML, it's not even remotely valid HTML. 结果,尽管这是有效的XML,但它甚至不是远程有效的HTML。

To fix this, I parsed the whole document as XML, extracted each HTML portion (as a string), and then parsed that HTML portion as HTML. 为了解决这个问题,我将整个文档解析为XML,提取了每个HTML部分(作为字符串),然后将该HTML部分解析为HTML。 Note that, somewhat confusingly, lxml is an HTML parser but lxml-xml is an XML parser. 注意,有些令人困惑的是, lxml是HTML解析器,而lxml-xml是XML解析器。

from bs4 import BeautifulSoup as Soup

with open('sample.kml') as data:
    kml_soup = Soup(data, 'lxml-xml') # Parse as XML

descriptions = kml_soup.find_all('description')
for description in descriptions:
    html_soup = Soup(description.text, 'lxml') # Parse as HTML
    tables = html_soup.find_all('table')
    print(len(tables))
    for table in tables:
        rows = table.find_all('tr')

        for row in rows:
            cols = row.find_all('td')
            ...

For the sample you provided, there were six tables. 对于您提供的示例,有六个表。 The above code prints "2" three times so it found all six of them. 上面的代码将“ 2”打印了三遍,因此找到了全部六个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM