简体   繁体   English

使用硒循环链接并获取页面源

[英]Using selenium to loop through links and get page sources

I'm trying to scrape two webpages with the following links: 我正在尝试使用以下链接抓取两个网页:

https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-holstebro/id-5792074 ' https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-odense-m/id-5769482 https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-holstebro/id-5792074https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-odense-m/ id-5769482

I want to extract information about each house in the links. 我想在链接中提取有关每个房屋的信息。 I use selenium and not beautifulsoup because the page is dynamic and beautifulsoup does not retrieve all the HTML-code. 我使用硒,而不是beautifulsoup,因为页面是动态的,beautifulsoup不会检索所有HTML代码。 I use the code below trying to achieve this. 我使用下面的代码来实现这一目标。

page_links=['https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-holstebro/id-5792074',
'https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-odense-m/id-5769482']

def render_page(url):
    driver = webdriver.Firefox()
    driver.get(url)
    time.sleep(3)
    r = driver.page_source
    driver.quit()
    return(r)

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return(re.sub(clean, '', text))

houses_html_code = []
housing_data = []
address = []

# Loop through main pages, render them and extract code
for i in page_links: 
    html = render_page(str(i))
    soup = BeautifulSoup(html, "html.parser")
    houses_html_code.append(soup)

for i in houses_html_code:
    for span_1 in soup.findAll('span', {"class": "AdFeatures__item-value"}):
    housing_data.append(remove_html_tags(str(span_1)))

So I summary I render the pages, get the page source, append the page source to a list and search for a span class in the pages sources of the two rendered pages. 因此,我总结一下,我渲染页面,获取页面源,将页面源附加到列表中,并在两个渲染页面的页面源中搜索span类。

However, my code returns the page source of the first link TWICE practically ignoring the second-page link even though it renders each page (firefox pops up with each page). 但是,我的代码返回了第一个链接的页面源,TWICE实际上忽略了第二个页面的链接,即使它渲染了每个页面(firefox随每个页面弹出)。 See output below. 请参见下面的输出。

Why is this not working? 为什么这不起作用? Sorry if the answer is obvious. 抱歉,答案很明显。 I'm rather new to Python and it is my first time using selenium 我对Python相当陌生,这是我第一次使用硒

['Lejlighed',
'82 m²',
'2',
'5. sal',
'Nej',
'Ja',
'Nej',
'-',
'Ubegrænset',
'Snarest',
'8.542,-',
'-',
'25.626,-',
'-',
'34.168,-',
'24/08-2018',
'3775136',
'Lejlighed',
'82 m²',
'2',
'5. sal',
'Nej',
'Ja',
'Nej',
'-',
'Ubegrænset',
'Snarest',
'8.542,-',
'-',
'25.626,-',
'-',
'34.168,-',
'24/08-2018',
'3775136']

You have a typo change: 您有一个拼写错误更改:

for span_1 in soup.findAll('span', {"class": "AdFeatures__item-value"}):

to

for span_1 in i.findAll('span', {"class": "AdFeatures__item-value"}):

But why do you create a new webdriver for each page? 但是,为什么要为每个页面创建一个新的webdriver? Why not do something like this: 为什么不做这样的事情:

page_links=['https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-holstebro/id-5792074', 'https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-odense-m/id-5769482']
driver = webdriver.Firefox()

def render_page(url):
    driver.get(url)
    ...

...
for i in houses_html_code:
    for span_1 in i.findAll('span', {"class": "AdFeatures__item-value"}):
         housing_data.append(remove_html_tags(str(span_1)))

driver.quit()

Outputs: 输出:

['Lejlighed', '78 m²', '2', '2. sal', 'Nej', 'Nej', 'Nej', '-', 'Ubegrænset', 'Snarest', '5.300,-', '800,-', '15.900,-', '0,-', '22.000,-', '27/10-2018', '3864958', 'Lejlighed', '82 m²', '2', '5. sal', 'Nej', 'Ja', 'Nej', '-', 'Ubegrænset', 'Snarest', '8.542,-', '-', '25.626,-', '-', '34.168,-', '24/08-2018', '3775136']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM