[英]Beautiful Soup - `findAll` not capturing all tags in SVG (`ElementTree` does)
I was attempting to generate a choropleth map by modifying an SVG map depicting all counties in the US. 我正在尝试通过修改描述美国所有县的SVG地图来生成Choropleth地图。 The basic approach is captured by Flowing Data . 基本方法由Flowing Data捕获。 Since SVG is basically just XML, the approach leverages the BeautifulSoup parser. 由于SVG基本上只是XML,因此该方法利用了BeautifulSoup解析器。
The thing is, the parser does not capture all path
elements in the SVG file. 问题是,解析器无法捕获SVG文件中的所有path
元素。 The following captured only 149 paths (out of over 3000): 以下仅捕获了149条路径(超过3000条路径):
#Open SVG file
svg=open(shp_dir+'USA_Counties_with_FIPS_and_names.svg','r').read()
#Parse SVG
soup = BeautifulSoup(svg, selfClosingTags=['defs','sodipodi:namedview'])
#Identify counties
paths = soup.findAll('path')
len(paths)
I know, however, that many more exist from both physical inspection, and the fact that ElementTree methods capture 3,143 paths with the following routine: 但是,我知道物理检查和ElementTree方法使用以下例程捕获3,143条路径的事实都存在很多:
#Parse SVG
tree = ET.parse(shp_dir+'USA_Counties_with_FIPS_and_names.svg')
#Capture element
root = tree.getroot()
#Compile list of IDs from file
ids=[]
for child in root:
if 'path' in child.tag:
ids.append(child.attrib['id'])
len(ids)
I have not yet figured out how to write from the ElementTree
object in a way that is not all messed up. 我还没有弄清楚如何以一种尚未完全弄乱的方式从ElementTree
对象进行写入。
#Define style template string
style='font-size:12px;fill-rule:nonzero;stroke:#FFFFFF;stroke-opacity:1;'+\
'stroke-width:0.1;stroke-miterlimit:4;stroke-dasharray:none;'+\
'stroke-linecap:butt;marker-start:none;stroke-linejoin:bevel;fill:'
#For each path...
for child in root:
#...if it is a path....
if 'path' in child.tag:
try:
#...update the style to the new string with a county-specific color...
child.attrib['style']=style+col_map[child.attrib['id']]
except:
#...if it's not a county we have in the ACS, leave it alone
child.attrib['style']=style+'#d0d0d0'+'\n'
#Write modified SVG to disk
tree.write(shp_dir+'mhv_by_cty.svg')
The modification/write routine above yields this monstrosity: 上面的修改/写入例程会产生这种怪异现象:
My primary question is this: why did BeautifulSoup fail to capture all of the path
tags? 我的主要问题是:为什么BeautifulSoup无法捕获所有path
标签? Second, why would the image modified with the ElementTree
objects have all of that extracurricular activity going on? 其次,为什么用ElementTree
对象修改的图像会进行所有此类课外活动? Any advice would be greatly appreciated. 任何建议将不胜感激。
You need to do the following: 您需要执行以下操作:
upgrade to beautifulsoup4
: 升级到beautifulsoup4
:
pip install beautifulsoup4 -U
import it as: 导入为:
from bs4 import BeautifulSoup
install latest lxml
module: 安装最新的lxml
模块:
pip install lxml -U
explicitly specify lxml
as a parser: 明确指定lxml
作为解析器:
soup = BeautifulSoup(svg, 'lxml')
Demo: 演示:
>>> from bs4 import BeautifulSoup
>>>
>>> svg = open('USA_Counties_with_FIPS_and_names.svg','r').read()
>>> soup = BeautifulSoup(svg, 'lxml')
>>> paths = soup.findAll('path')
>>> len(paths)
3143
alexce's answer is correct for your first question. 亚历克斯的答案对您的第一个问题是正确的。 As far as your second question is concerned: 关于第二个问题:
why would the image modified with the ElementTree objects have all of that extracurricular activity going on? 为什么用ElementTree对象修改的图像会进行所有这些课外活动? " ”
the answer is pretty simple - not every <path>
element draws a county. 答案很简单-并非每个<path>
元素都画一个县。 Specifically, there are two elements, one with id="State_Lines"
and one with id="separator"
, that should be eliminated. 具体来说,应该删除两个元素,一个具有id="State_Lines"
,另一个具有id="separator"
。 You didn't supply your dataset of colors, so I just used a random hex color generator (adapted from here ) for each county, then used lxml
to parse the .svg
's XML and iterate through each <path>
element, skipping the ones I mentioned above: 您没有提供颜色数据集,所以我只为每个县使用了一个随机的十六进制颜色生成器(从此处改编),然后使用lxml
解析.svg
的XML并遍历每个<path>
元素,跳过了我上面提到的那些:
from lxml import etree as ET
import random
def random_color():
r = lambda: random.randint(0,255)
return '#%02X%02X%02X' % (r(),r(),r())
new_style = 'font-size:12px;fill-rule:nonzero;stroke:#FFFFFF;stroke-opacity:1;stroke-width:0.1;stroke-miterlimit:4;stroke-dasharray:none;stroke-linecap:butt;marker-start:none;stroke-linejoin:bevel;fill:'
tree = ET.parse('USA_Counties_with_FIPS_and_names.svg')
root = tree.getroot()
for child in root:
if 'path' in child.tag and child.attrib['id'] not in ["separator", "State_Lines"]:
child.attrib['style'] = new_style + random_color()
tree.write('counties_new.svg')
resulting in this nice image: 产生了这个漂亮的图像:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.