简体   繁体   English

Python Beautiful Soup停止解析

[英]Python Beautiful Soup stops parsing

I try to parse the attached text.txt file (with html syntax) by the following script. 我尝试通过以下脚本解析附加的text.txt文件(使用html语法)。

#!/usr/bin/python3

import re
from bs4 import BeautifulSoup

pattern = re.compile("www.geocaching.com")
f=open("text.txt")
text=f.read()
f.close()
s = BeautifulSoup(text)
a = s.find_all(href=pattern)
print(len(a))
print (a[len(a)-1])

My expectation is to have all tags with href="www.geocaching.com", but I do not get all from the file attached. 我期望所有标签都带有href =“ www.geocaching.com”,但我无法从附件中获取全部标签。 The last one is: 最后一个是:

<a class="lnk " href="http://www.geocaching.com/geocache/GC3HWHJ_corse-known-unknown-2-view-on-ile-de-giraglia"><span>Corse known &amp; unknown 2 - View on Ile de Giraglia</span></a>

if I delete the lines 626-674, containing only some simple html code, I get the next two, ie the last is 如果我删除仅包含一些简单html代码的行626-674,则会得到下两个,即最后一个是

<a class="lnk " href="http://www.geocaching.com/geocache/GC3MEDG_tour-genoise-dagnello"><span>TOUR GENOISE D'AGNELLO</span></a>

but again I don't get all results I can find manually in the html file. 但同样,我没有得到可以在html文件中手动找到的所有结果。

The file I use is from here (I downloaded it to use it locally) https://www.geocaching.com/seek/nearest.aspx?lat=43.410333&lon=09.0476&dist=100 我使用的文件来自此处(我已下载该文件以在本地使用) https://www.geocaching.com/seek/nearest.aspx?lat=43.410333&lon=09.0476&dist=100

Try to use CSS Selector instead in the following way : 尝试通过以下方式使用CSS选择器:

from bs4 import BeautifulSoup

f = open("text.txt")
text = f.read()
f.close()

soup = BeautifulSoup(text)

# this find all the href containing the text "www.geocaching.com"
links =  soup.select('[href]~="www.geocaching.com"')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM