简体   繁体   English

使用Beautiful Soup 4和Python解析错误

[英]Parsing error with Beautiful Soup 4 and Python

I need to get the list of the rooms from this website: http://www.studentroom.ch/en/dynasite.cfm?dsmid=106547 我需要从此网站获取房间列表: http : //www.studentroom.ch/en/dynasite.cfm?dsmid=106547

I'm using Beautiful Soup 4 in order to parse the page. 我正在使用Beautiful Soup 4来解析页面。 This is the code I wrote until now: 这是我到目前为止编写的代码:

from bs4 import BeautifulSoup
import urllib

pageFile = urllib.urlopen("http://studentroom.ch/dynasite.cfm?dsmid=106547")
pageHtml = pageFile.read()
pageFile.close()

soup = BeautifulSoup("".join(pageHtml))

roomsNoFilter = soup.find('div', {"id": "ImmoListe"})

rooms = roomsNoFilter.table.find_all('tr', recursive=False)

for room in rooms:
    print room
    print "----------------"

print len(rooms)

For now I'm trying to get only the rows of the table. 现在,我正在尝试仅获取表的行。 But I get only 7 rows instead of 78 (or 77). 但是我只有7行,而不是78行(或77行)。

At first I tough that I was receiving only a partial html, but I printed the whole html and I'm receiving it correctly. 起初,我很难接受只接收部分html,但是我打印了整个html,并且正确接收了它。 There's no ajax calls that loads new rows after the page loaded... 页面加载后没有ajax调用会加载新行...

Someone could please help me finding the error? 有人可以帮助我找到错误吗?

This is working for me 这对我有用

soup = BeautifulSoup(pageHtml)
div = soup.select('#ImmoListe')[0]
table = div.select('table > tbody')[0]
k = 0
for room in table.find_all('tr'):
    if 'onmouseout' in str(room):
        print room
        k = k + 1
print "Total ",k

Let me know the status 让我知道状态

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM