简体   繁体   English

美丽的汤-`findAll`不能捕获SVG中的所有标签(`ElementTree`可以)

[英]Beautiful Soup - `findAll` not capturing all tags in SVG (`ElementTree` does)

I was attempting to generate a choropleth map by modifying an SVG map depicting all counties in the US. 我正在尝试通过修改描述美国所有县的SVG地图来生成Choropleth地图。 The basic approach is captured by Flowing Data . 基本方法由Flowing Data捕获。 Since SVG is basically just XML, the approach leverages the BeautifulSoup parser. 由于SVG基本上只是XML,因此该方法利用了BeautifulSoup解析器。

The thing is, the parser does not capture all path elements in the SVG file. 问题是,解析器无法捕获SVG文件中的所有path元素。 The following captured only 149 paths (out of over 3000): 以下仅捕获了149条路径(超过3000条路径):

#Open SVG file
svg=open(shp_dir+'USA_Counties_with_FIPS_and_names.svg','r').read()

#Parse SVG
soup = BeautifulSoup(svg, selfClosingTags=['defs','sodipodi:namedview'])

#Identify counties
paths = soup.findAll('path')

len(paths)

I know, however, that many more exist from both physical inspection, and the fact that ElementTree methods capture 3,143 paths with the following routine: 但是,我知道物理检查和ElementTree方法使用以下例程捕获3,143条路径的事实都存在很多:

#Parse SVG
tree = ET.parse(shp_dir+'USA_Counties_with_FIPS_and_names.svg')

#Capture element
root = tree.getroot()

#Compile list of IDs from file
ids=[]
for child in root:
    if 'path' in child.tag:
        ids.append(child.attrib['id'])

len(ids)

I have not yet figured out how to write from the ElementTree object in a way that is not all messed up. 我还没有弄清楚如何以一种尚未完全弄乱的方式从ElementTree对象进行写入。

#Define style template string
style='font-size:12px;fill-rule:nonzero;stroke:#FFFFFF;stroke-opacity:1;'+\
        'stroke-width:0.1;stroke-miterlimit:4;stroke-dasharray:none;'+\
        'stroke-linecap:butt;marker-start:none;stroke-linejoin:bevel;fill:'

#For each path...
for child in root:
    #...if it is a path....
    if 'path' in child.tag:
        try:
            #...update the style to the new string with a county-specific color...
            child.attrib['style']=style+col_map[child.attrib['id']]
        except:
            #...if it's not a county we have in the ACS, leave it alone
            child.attrib['style']=style+'#d0d0d0'+'\n'

#Write modified SVG to disk
tree.write(shp_dir+'mhv_by_cty.svg')

The modification/write routine above yields this monstrosity: 上面的修改/写入例程会产生这种怪异现象:

各县平均房屋价值中位数

My primary question is this: why did BeautifulSoup fail to capture all of the path tags? 我的主要问题是:为什么BeautifulSoup无法捕获所有path标签? Second, why would the image modified with the ElementTree objects have all of that extracurricular activity going on? 其次,为什么用ElementTree对象修改的图像会进行所有此类课外活动? Any advice would be greatly appreciated. 任何建议将不胜感激。

You need to do the following: 您需要执行以下操作:

  • upgrade to beautifulsoup4 : 升级到beautifulsoup4

     pip install beautifulsoup4 -U 
  • import it as: 导入为:

     from bs4 import BeautifulSoup 
  • install latest lxml module: 安装最新的lxml模块:

     pip install lxml -U 
  • explicitly specify lxml as a parser: 明确指定lxml作为解析器:

     soup = BeautifulSoup(svg, 'lxml') 

Demo: 演示:

>>> from bs4 import BeautifulSoup
>>> 
>>> svg = open('USA_Counties_with_FIPS_and_names.svg','r').read()
>>> soup = BeautifulSoup(svg, 'lxml')
>>> paths = soup.findAll('path')
>>> len(paths)
3143

alexce's answer is correct for your first question. 亚历克斯的答案对您的第一个问题是正确的。 As far as your second question is concerned: 关于第二个问题:

why would the image modified with the ElementTree objects have all of that extracurricular activity going on? 为什么用ElementTree对象修改的图像会进行所有这些课外活动? "

the answer is pretty simple - not every <path> element draws a county. 答案很简单-并非每个<path>元素都画一个县。 Specifically, there are two elements, one with id="State_Lines" and one with id="separator" , that should be eliminated. 具体来说,应该删除两个元素,一个具有id="State_Lines" ,另一个具有id="separator" You didn't supply your dataset of colors, so I just used a random hex color generator (adapted from here ) for each county, then used lxml to parse the .svg 's XML and iterate through each <path> element, skipping the ones I mentioned above: 您没有提供颜色数据集,所以我只为每个县使用了一个随机的十六进制颜色生成器(从此处改编),然后使用lxml解析.svg的XML并遍历每个<path>元素,跳过了我上面提到的那些:

from lxml import etree as ET
import random

def random_color():
    r = lambda: random.randint(0,255)
    return '#%02X%02X%02X' % (r(),r(),r())

new_style = 'font-size:12px;fill-rule:nonzero;stroke:#FFFFFF;stroke-opacity:1;stroke-width:0.1;stroke-miterlimit:4;stroke-dasharray:none;stroke-linecap:butt;marker-start:none;stroke-linejoin:bevel;fill:'

tree = ET.parse('USA_Counties_with_FIPS_and_names.svg')
root = tree.getroot()
for child in root:
    if 'path' in child.tag and child.attrib['id'] not in ["separator", "State_Lines"]:
        child.attrib['style'] = new_style + random_color()

tree.write('counties_new.svg')

resulting in this nice image: 产生了这个漂亮的图像:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM