![](/img/trans.png)
[英]web scraping problem with data i don't know how to export information from file.html to my python programme
[英]Parsing XML from string reports “junk” in my file, I don't know how to locate it
我正在嘗試使用元素樹解析 XML 字符串。 該字符串來自許多連接在一起的 dict 值。 沒有根節點,但第一次運行良好。
我第一次這樣做並且成功了:
for value in data.values():
myxml = ' '.join(value)
tree = ET.fromstring(myxml)
但是對於同樣的情況,只是另一本字典,它不起作用。 我的代碼很簡單:
values = [x for x in dict_fasi.values()]
myxml_fasi = ' '.join(values)
tree2 = ET.fromstring(myxml_fasi)
我也像以前一樣嘗試了循環,但它沒有用。 錯誤說: xml.etree.ElementTree.ParseError: junk after document element: line 8, column 20 。
第 8 行應該是:
</new_line> <new_line>
XML 字符串是:
<new_line>
<text font="NUMPTY+ImprintMTnum" bbox="297.284,540.828,300.188,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">della quale non conosce che una parte;] </text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="322.455,540.839,328.251,553.566" colourspace="DeviceGray" ncolour="0" size="12.727">prima</text>
<text font="NUMPTY+ImprintMTnum" bbox="331.206,545.345,334.683,552.834" colourspace="DeviceGray" ncolour="0" size="7.489">1</text>
<text font="NUMPTY+ImprintMTnum" bbox="177.602,528.028,180.850,540.510" colourspace="DeviceGray" ncolour="0" size="12.482">che nonconosce ancora appieno;</text>
<text font="NUMPTY+ImprintMTnum" bbox="189.430,532.545,192.908,540.034" colourspace="DeviceGray" ncolour="0" size="7.489">2</text>
<text font="NUMPTY+ImprintMTnum" bbox="203.879,528.028,208.975,540.510" colourspace="DeviceGray" ncolour="0" size="12.482">che</text>
</new_line> <new_line>
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="315.109,462.272,319.863,472.957" colourspace="DeviceGray" ncolour="0" size="10.685">5</text>
<text font="NUMPTY+ImprintMTnum" bbox="368.916,461.828,372.743,474.310" colourspace="DeviceGray" ncolour="0" size="12.482">avederci]</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="86.577,449.039,92.373,461.766" colourspace="DeviceGray" ncolour="0" size="12.727">sps.a</text>
<text font="NUMPTY+ImprintMTnum" bbox="167.611,449.028,172.707,461.510" colourspace="DeviceGray" ncolour="0" size="12.482">dove io andava a</text>
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="68.031,421.672,72.786,432.357" colourspace="DeviceGray" ncolour="0" size="10.685">5</text>
<text font="NUMPTY+ImprintMTnum" bbox="137.296,421.228,140.200,433.710" colourspace="DeviceGray" ncolour="0" size="12.482">tante libertà] </text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="161.868,421.239,167.664,433.966" colourspace="DeviceGray" ncolour="0" size="12.727">prima</text>
<text font="NUMPTY+ImprintMTnum" bbox="170.784,425.745,174.262,433.234" colourspace="DeviceGray" ncolour="0" size="7.489">1</text>
<text font="NUMPTY+ImprintMTnum" bbox="174.297,421.228,183.920,433.710" colourspace="DeviceGray" ncolour="0" size="12.482">m</text>
<text font="MUVAOR+Symbol" bbox="194.367,421.612,199.376,431.672" colourspace="DeviceGray" ncolour="0" size="10.060"><></text>
<text font="NUMPTY+ImprintMTnum" bbox="208.349,425.745,211.827,433.234" colourspace="DeviceGray" ncolour="0" size="7.489">2</text>
<text font="NUMPTY+ImprintMTnum" bbox="244.601,421.228,250.976,433.710" colourspace="DeviceGray" ncolour="0" size="12.482">certe lib</text>
<text font="MUVAOR+Symbol" bbox="250.901,421.612,255.910,431.672" colourspace="DeviceGray" ncolour="0" size="10.060"><</text>
<text font="NUMPTY+ImprintMTnum" bbox="269.331,421.228,274.426,433.710" colourspace="DeviceGray" ncolour="0" size="12.482">ertà</text>
<text font="MUVAOR+Symbol" bbox="274.363,421.612,279.373,431.672" colourspace="DeviceGray" ncolour="0" size="10.060">></text>
</new_line> <new_line>
第一個 XML 字符串是這樣的:
<new_line>
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="234.782,118.872,239.536,129.558" colourspace="DeviceGray" ncolour="0" size="10.685">80</text>
<text font="NUMPTY+ImprintMTnum" bbox="360.280,118.428,363.184,130.911" colourspace="DeviceGray" ncolour="0" size="12.482">pazienza, e la prudenza.] </text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="369.339,118.440,375.135,131.167" colourspace="DeviceGray" ncolour="0" size="12.727">da</text>
<text font="NUMPTY+ImprintMTnum" bbox="113.588,105.629,118.684,118.111" colourspace="DeviceGray" ncolour="0" size="12.482">pa-zienza</text>
<text font="MUVAOR+Symbol" bbox="120.415,105.707,124.422,117.543" colourspace="DeviceGray" ncolour="0" size="11.835">=</text>
</new_line>
<new_line>
<text font="NUMPTY+ImprintMTnum" bbox="194.095,105.629,196.999,118.111" colourspace="DeviceGray" ncolour="0" size="12.482">Cristoforo] </text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="214.031,105.640,219.827,118.367" colourspace="DeviceGray" ncolour="0" size="12.727">sts.a</text>
<text font="NUMPTY+ImprintMTnum" bbox="241.600,81.508,247.396,93.991" colourspace="DeviceGray" ncolour="0" size="12.482">Galdino 72</text>
<text font="SZWUPJ+ImprintExpertMT" bbox="272.785,614.422,276.490,625.380" colourspace="DeviceGray" ncolour="0" size="10.958"> </text>
<text font="NUMPTY+ImprintMTnum" bbox="53.923,592.408,58.102,602.646" colourspace="DeviceGray" ncolour="0" size="10.238">34c</text>
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="72.640,592.472,77.394,603.157" colourspace="DeviceGray" ncolour="0" size="10.685">80</text>
<text font="NUMPTY+ImprintMTnum" bbox="187.701,592.028,190.605,604.510" colourspace="DeviceGray" ncolour="0" size="12.482">troverà … immaginare] </text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="201.265,592.039,204.169,604.766" colourspace="DeviceGray" ncolour="0" size="12.727">da </text>
<text font="NUMPTY+ImprintMTnum" bbox="305.701,592.028,310.796,604.510" colourspace="DeviceGray" ncolour="0" size="12.482">qualche rimedio inaspe</text>
<text font="MUVAOR+Symbol" bbox="310.691,592.412,315.701,602.472" colourspace="DeviceGray" ncolour="0" size="10.060"><</text>
<text font="NUMPTY+ImprintMTnum" bbox="331.518,592.028,337.314,604.510" colourspace="DeviceGray" ncolour="0" size="12.482">ttato</text>
<text font="MUVAOR+Symbol" bbox="337.154,592.412,342.163,602.472" colourspace="DeviceGray" ncolour="0" size="10.060">></text>
</new_line>
可能是new_line
標簽的開閉問題,但是不知道怎么解決。
錯誤消息中的“垃圾”一詞似乎是一種相當不公平的價值判斷; 但這意味着解析器希望看到單個頂級元素,並且當它到達該元素的末尾(以及任何尾隨注釋或 PI)時,它希望看到文件的結尾。 如果有另一個元素開始標記,則它不是格式良好的 XML 文檔。
您說您知道沒有根節點,但您似乎沒有意識到這會使文檔格式錯誤。 你說它第一次起作用:嗯,它不應該起作用。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.