[英]Scraperwiki + lxml. How to get the href attribute of a child of an element with a class?
在URL中包含“alpha”的鏈接上有許多鏈接(hrefs),我想從20個不同的頁面收集這些鏈接並粘貼到通用URL的末尾(第二行最后一行)。 href可以在一個表中找到,該類對於td是mys-elastic mys-left,而a顯然是包含href屬性的元素。 任何幫助都會非常感激,因為我已經在這里工作了大約一個星期。
for i in range(1, 11):
# The HTML Scraper for the 20 pages that list all the exhibitors
url = 'http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm?alpha=%40&type=alpha&page=' + str(i) + '#GotoResults'
print url
list_html = scraperwiki.scrape(url)
root = lxml.html.fromstring(list_html)
href_element = root.cssselect('td.mys-elastic mys-left a')
for element in href_element:
# Convert HTMl to lxml Object
href = href_element.get('href')
print href
page_html = scraperwiki.scrape('http://ahr13.mapyourshow.com' + href)
print page_html
無需使用javascript進行破解 - 這些都在html中:
import scraperwiki
import lxml.html
html = scraperwiki.scrape('http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm? alpha=%40&type=alpha&page=1')
root = lxml.html.fromstring(html)
# get the links
hrefs = root.xpath('//td[@class="mys-elastic mys-left"]/a')
for href in hrefs:
print 'http://ahr13.mapyourshow.com' + href.attrib['href']
import lxml.html as lh
from itertools import chain
URL = 'http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm?alpha=%40&type=alpha&page='
BASE = 'http://ahr13.mapyourshow.com'
path = '//table[2]//td[@class="mys-elastic mys-left"]//@href'
results = []
for i in range(1,21):
doc=lh.parse(URL+str(i))
results.append(BASE+i for i in doc.xpath(path))
print list(chain(*results))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.