简体   繁体   中英

Parsing KML with Beautiful Soup

I am having trouble parsing a KML file (XML) using beautiful soup. This snippet of code should have a non zero number of iterations for every level lxml returns in my sample 2 tables xml parser returns 0 and the number should be 3

from bs4 import BeautifulSoup

url = "sample.kml"

with open(url,'r') as page:

    soup = BeautifulSoup(page, "lxml")

    tables = soup.find_all('table')
    print(len(tables))

    for table in tables:    
        rows = table.find_all('tr')

        for row in rows:    
            cols = row.find_all('td')

This first sample script returns 2 tables instead of 3 using lxml and 0 with xml parser.

soup = BeautifulSoup(page, "xml")

    placemark = soup.find_all('Placemark')
    print(len(placemark))

    for place in placemark:

        tables = place.find_all('table')
        print(len(tables))

        for table in tables:    
            rows = table.find_all('tr')

            for row in rows:    
                cols = row.find_all('td')

traversing the tree I originally started searching for tables which len(tables) returned 2 which I know to be false should be about 92,000 so I found another tag to start stepping through the tree which was (returned correct count), and attempted to then find the rows and columns within each tag of which they all returned zero which surprised me. I played around with different parsers eventually determining that xml was the appropriate one however still was unable to find the correct amount of tables despite being able to find them using re.search or search in sublime text, this then lead me to check for ways it might have been encapsulated but to no avail. I am quite stuck and cant seem to find a way to access the 92,000 tables using the find_all("TAG") method. any suggestions?

Sample KML

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document id="laaSECS" xsi:schemaLocation="http://www.opengis.net/kml/2.2 http://schemas.opengis.net/kml/2.2.0/ogckml22.xsd http://www.google.com/kml/ext/2.2 http://code.google.com/apis/kml/schema/kml22gx.xsd">
    <name>laaSECS</name>
    <Snippet maxLines="0"></Snippet>
    <Style id="PolyStyle00">
        <LabelStyle>
            <color>00000000</color>
            <scale>0</scale>
        </LabelStyle>
        <LineStyle>
            <color>ff7f5555</color>
            <width>0.2</width>
        </LineStyle>
        <PolyStyle>
            <color>ffc5d9fa</color>
            <fill>0</fill>
        </PolyStyle>
    </Style>
    <Style id="PolyStyle000">
        <LabelStyle>
            <color>00000000</color>
            <scale>0</scale>
        </LabelStyle>
        <LineStyle>
            <color>ff7f5555</color>
            <width>0.2</width>
        </LineStyle>
        <PolyStyle>
            <color>ffc5d9fa</color>
            <fill>0</fill>
        </PolyStyle>
    </Style>
    <StyleMap id="PolyStyle001">
        <Pair>
            <key>normal</key>
            <styleUrl>#PolyStyle00</styleUrl>
        </Pair>
        <Pair>
            <key>highlight</key>
            <styleUrl>#PolyStyle000</styleUrl>
        </Pair>
    </StyleMap>
    <Folder id="FeatureLayer0">
        <name>laaSECS</name>
        <Snippet maxLines="0"></Snippet>
        <Placemark id="ID_00000">
            <name>AL</name>
            <Snippet maxLines="0"></Snippet>
            <description><![CDATA[<html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt">

<head>

<META http-equiv="Content-Type" content="text/html">

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

</head>

<body style="margin:0px 0px 0px 0px;overflow:auto;background:#FFFFFF;">

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-collapse:collapse;padding:3px 3px 3px 3px">

<tr style="text-align:center;font-weight:bold;background:#9CBCE2">

<td>AL</td>

</tr>

<tr>

<td>

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px">

<tr>

<td>FID</td>

<td>0</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>STATE</td>

<td>AL</td>

</tr>

<tr>

<td>MER</td>

<td>25</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>TWP</td>

<td>22</td>

</tr>

<tr>

<td>TDIR</td>

<td>N</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>RNG</td>

<td>4</td>

</tr>

<tr>

<td>RDIR</td>

<td>W</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>SEC</td>

<td>24</td>

</tr>

<tr>

<td>MODDATE</td>

<td>20050311</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>DATUM</td>

<td>NAD27</td>

</tr>

<tr>

<td>SOURCE</td>

<td>WhiteStar</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>MTR</td>

<td>25 22.0N  4.0W</td>

</tr>

</table>

</td>

</tr>

</table>

</body>

</html>]]></description>
            <styleUrl>#PolyStyle001</styleUrl>
            <MultiGeometry>
                <Polygon>
                    <outerBoundaryIs>
                        <LinearRing>
                            <coordinates>
                                -88.35570867858526,32.86011073571817,0 -88.35570870147141,32.86253443065814,0 -88.35597594524225,32.86011537400984,0 -88.35570867858526,32.86011073571817,0 
                            </coordinates>
                        </LinearRing>
                    </outerBoundaryIs>
                </Polygon>
            </MultiGeometry>
        </Placemark>
        <Placemark id="ID_00001">
            <name>AL</name>
            <Snippet maxLines="0"></Snippet>
            <description><![CDATA[<html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt">

<head>

<META http-equiv="Content-Type" content="text/html">

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

</head>

<body style="margin:0px 0px 0px 0px;overflow:auto;background:#FFFFFF;">

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-collapse:collapse;padding:3px 3px 3px 3px">

<tr style="text-align:center;font-weight:bold;background:#9CBCE2">

<td>AL</td>

</tr>

<tr>

<td>

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px">

<tr>

<td>FID</td>

<td>1</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>STATE</td>

<td>AL</td>

</tr>

<tr>

<td>MER</td>

<td>25</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>TWP</td>

<td>22</td>

</tr>

<tr>

<td>TDIR</td>

<td>N</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>RNG</td>

<td>4</td>

</tr>

<tr>

<td>RDIR</td>

<td>W</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>SEC</td>

<td>25</td>

</tr>

<tr>

<td>MODDATE</td>

<td>20050311</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>DATUM</td>

<td>NAD27</td>

</tr>

<tr>

<td>SOURCE</td>

<td>WhiteStar</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>MTR</td>

<td>25 22.0N  4.0W</td>

</tr>

</table>

</td>

</tr>

</table>

</body>

</html>]]></description>
            <styleUrl>#PolyStyle001</styleUrl>
            <MultiGeometry>
                <Polygon>
                    <outerBoundaryIs>
                        <LinearRing>
                            <coordinates>
                                -88.35597594524225,32.86011537400984,0 -88.3567389068841,32.85292852502473,0 -88.35768486975799,32.84508568993779,0 -88.35570853700197,32.84511675513796,0 -88.35570867858526,32.86011073571817,0 -88.35597594524225,32.86011537400984,0 
                            </coordinates>
                        </LinearRing>
                    </outerBoundaryIs>
                </Polygon>
            </MultiGeometry>
        </Placemark>
        <Placemark id="ID_00002">
            <name>AL</name>
            <Snippet maxLines="0"></Snippet>
            <description><![CDATA[<html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt">

<head>

<META http-equiv="Content-Type" content="text/html">

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

</head>

<body style="margin:0px 0px 0px 0px;overflow:auto;background:#FFFFFF;">

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-collapse:collapse;padding:3px 3px 3px 3px">

<tr style="text-align:center;font-weight:bold;background:#9CBCE2">

<td>AL</td>

</tr>

<tr>

<td>

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px">

<tr>

<td>FID</td>

<td>2</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>STATE</td>

<td>AL</td>

</tr>

<tr>

<td>MER</td>

<td>25</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>TWP</td>

<td>22</td>

</tr>

<tr>

<td>TDIR</td>

<td>N</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>RNG</td>

<td>4</td>

</tr>

<tr>

<td>RDIR</td>

<td>W</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>SEC</td>

<td>36</td>

</tr>

<tr>

<td>MODDATE</td>

<td>20050311</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>DATUM</td>

<td>NAD27</td>

</tr>

<tr>

<td>SOURCE</td>

<td>WhiteStar</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>MTR</td>

<td>25 22.0N  4.0W</td>

</tr>

</table>

</td>

</tr>

</table>

</body>

</html>]]></description>
            <styleUrl>#PolyStyle001</styleUrl>
            <MultiGeometry>
                <Polygon>
                    <outerBoundaryIs>
                        <LinearRing>
                            <coordinates>
                                -88.35768486975799,32.84508568993779,0 -88.35843183642189,32.83843382961495,0 -88.35914980106479,32.83165897171819,0 -88.35908878782671,32.83049899571662,0 -88.35570839957039,32.83056244880483,0 -88.35570853700197,32.84511675513796,0 -88.35768486975799,32.84508568993779,0 
                            </coordinates>
                        </LinearRing>
                    </outerBoundaryIs>
                </Polygon>
            </MultiGeometry>
        </Placemark>
        <Placemark id="ID_00003">
            <name>AL</name>
            <Snippet maxLines="0"></Snippet>
            <description><![CDATA[<html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt">

Link to the originalfile KML FILE

The root of your problem is you have an XML document with nested HTML documents in it. Attempting to parse the whole thing has HTML isn't working because the HTML documents appear to be stored as a tag. As a result, while this is valid XML, it's not even remotely valid HTML.

To fix this, I parsed the whole document as XML, extracted each HTML portion (as a string), and then parsed that HTML portion as HTML. Note that, somewhat confusingly, lxml is an HTML parser but lxml-xml is an XML parser.

from bs4 import BeautifulSoup as Soup

with open('sample.kml') as data:
    kml_soup = Soup(data, 'lxml-xml') # Parse as XML

descriptions = kml_soup.find_all('description')
for description in descriptions:
    html_soup = Soup(description.text, 'lxml') # Parse as HTML
    tables = html_soup.find_all('table')
    print(len(tables))
    for table in tables:
        rows = table.find_all('tr')

        for row in rows:
            cols = row.find_all('td')
            ...

For the sample you provided, there were six tables. The above code prints "2" three times so it found all six of them.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM