简体   繁体   English

如何使用PhantomJS和Selenium浏览页面

[英]How to browse over a page using PhantomJS and Selenium

I got some DIV elements on a web page. 我在网页上有一些DIV元素。 Totally there are abound 30 DIV blocks of the following similar structure: 总共有30个具有以下类似结构的DIV块:

 <div class="w-dyn-item"> <a href="/project/soft" class="jobs-wrapper no-line w-inline-block w-clearfix"> <div class="jobs-client"> <img data-qazy="true" src="https://global.com/test.jpg" alt="Soft" class="image-9"> <div style="background-color:#cd7f32" class="job-time">Level 1</div> </div> <div class="jobs-content w-clearfix"> <div class="w-clearfix"> <div class="text-block-19 w-condition-invisible">PROMO</div> <h3 class="job-title">Soft</h3> <img height="30" data-qazy="true" src="https://global.com/test.jpg" alt="Soft" class="image-15 w-hidden-main w-hidden-medium w-hidden-small"></div> <div class="div-block w-clearfix"> <div class="text-block-4">Italy</div> <div class="text-block-4 w-hidden-small w-hidden-tiny">AMB</div> <div class="text-block-4 w-hidden-small w-hidden-tiny">GTL</div> <div class="text-block-13">January 10, 2017</div><div class="text-block-14">End date:</div></div><div class="space small"></div><p class="paragraph-3">Text text text</p></div> </a> </div> 

I am trying to access a href and click on the link. 我正在尝试访问a href并单击链接。 However, the problem is that I cannot use find_element_by_link_text , because the link text does not exist. 但是,问题是我不能使用find_element_by_link_text ,因为链接文本不存在。 Is it possible to access a href by class class="jobs-wrapper no-line w-inline-block w-clearfix" ? 是否可以通过类class="jobs-wrapper no-line w-inline-block w-clearfix"访问a href When I used find_element_by_class_name , I got the error Message: {"errorMessage":"Compound class names not permitted","request 当我使用find_element_by_class_name ,出现错误Message: {"errorMessage":"Compound class names not permitted","request

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
driver.get("https://myurl.com/")
driver.find_element_by_link_text("//a href").click()
print driver.current_url
driver.quit()

If your only requirement is to click the a tag inside a tag with w-dyn-item class, then you could do it like this: 如果您唯一的要求是单击带有w-dyn-itema标签内的标签,则可以这样做:

driver.find_element_by_class_name("w-dyn-item").find_element_by_tag_name("a").click()


To iterate over all tags with w-dyn-item class -> click the a inside them -> do something -> go back, do this: 要使用w-dyn-item类遍历所有标签->单击其中a a->做某事->返回,执行以下操作:

tags = driver.find_elements_by_class_name("w-dyn-item")
for i in range(len(tags)):
    tag = driver.find_elements_by_class_name("w-dyn-item")[i]
    tag.find_element_by_tag_name("a").click()
    # Do what you want inside the page...
    driver.back()

The key here is of course to go back to the root page after you're done with the inner page. 当然,这里的关键是在完成内部页面之后返回到根页面。

The error you're getting is because Selenium's find_element_by_class_name does not support multiple classes. 您收到的错误是因为Selenium的find_element_by_class_name不支持多个类。
Use a CSS selector with find_elements_by_css_selector instead: 将CSS选择器与find_elements_by_css_selector使用:

driver.find_elements_by_css_selector('.jobs-wrapper.no-line.w-inline-block.w-clearfix')

Will select all tags with your wanted class, then you can iterate over them and use click() or any other wanted action 将选择您想要的类的所有标签,然后您可以遍历它们并使用click()或任何其他想要的操作

EDIT 编辑

Following your comment, new snippet to help you do what you want: 在发表评论后,新的代码段可帮助您完成所需的操作:

result = {}
urls = []
# 'elements' is a the list you previously obtained using the css selector
for element in elements:
    urls.append(element.get_attribute('href'))


# Now you can iterate over all extracted hrefs:
for url in urls:
    url_data = {}
    driver.get(url)
    field1 = driver.find_element_by_id('wanted_id_1')
    url_data['field1'] = field1
    field2 = driver.find_element_by_id('wanted_id_2')
    url_data['field2'] = field2
    result[url] = url_data

Now, result is a dictionary in a structure similar to what you wanted. 现在, result是结构类似于您想要的字典。

Note that field1 and field2 are of type WebElement so you'll probably need to do something with them first (extract attribute, text, etc). 请注意, field1field2属于WebElement类型,因此您可能首先需要对它们进行一些操作(提取属性,文本等)。

Also, on personal note, Look into the requests together with BeautifulSoup , they might be a way better fit than Selenium for this or future similar cases. 另外,就个人而言,与BeautifulSoup一起调查请求 ,对于这种情况或将来的类似情况,它们可能比Selenium更适合。

要访问并单击a href ,可以使用以下代码行:

driver.find_element_by_xpath("//div[@class='w-dyn-item']/a[@href='/project/soft']").click()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM