简体   繁体   English

使用硒从网站上抓取价值

[英]Scrape values from Website using Selenium

I am trying to extract data from the following website: 我正在尝试从以下网站提取数据:

https://www.tipranks.com/stocks/sui/stock-analysis https://www.tipranks.com/stocks/sui/stock-analysis

I am targeting the value "6" in the octagon: 我的目标是八角形的值“ 6”:

在此处输入图片说明

I believe I am targeting the correct xpath. 我相信我的目标是正确的xpath。

Here is my code: 这是我的代码:

import sys
import os
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium import webdriver

os.environ['MOZ_HEADLESS'] = '1'
binary = FirefoxBinary('C:/Program Files/Mozilla Firefox/firefox.exe', log_file=sys.stdout)

browser = webdriver.PhantomJS(service_args=["--load-images=no", '--disk-cache=true'])

url = 'https://www.tipranks.com/stocks/sui/stock-analysis'
xpath = '/html/body/div[1]/div/div/div/div/main/div/div/article/div[2]/div/main/div[1]/div[2]/section[1]/div[1]/div[1]/div/svg/text/tspan'
browser.get(url)

element = browser.find_element_by_xpath(xpath)

print(element)

Here is the error that I get back: 这是我回来的错误:

Traceback (most recent call last):
  File "C:/Users/jaspa/PycharmProjects/ig-markets-api-python-library/trader/market_signal_IV_test.py", line 15, in <module>
    element = browser.find_element_by_xpath(xpath)
  File "C:\Users\jaspa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 394, in find_element_by_xpath
    return self.find_element(by=By.XPATH, value=xpath)
  File "C:\Users\jaspa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 978, in find_element
    'value': value})['value']
  File "C:\Users\jaspa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "C:\Users\jaspa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: {"errorMessage":"Unable to find element with xpath '/html/body/div[1]/div/div/div/div/main/div/div/article/div[2]/div/main/div[1]/div[2]/section[1]/div[1]/div[1]/div/svg/text/tspan'","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Content-Length":"96","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:51786","User-Agent":"selenium/3.141.0 (python windows)"},"httpVersion":"1.1","method":"POST","post":"{\"using\": \"xpath\", \"value\": \"/h3/div/span\", \"sessionId\": \"d8e91c70-9139-11e9-a9c9-21561f67b079\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/d8e91c70-9139-11e9-a9c9-21561f67b079/element"}}
Screenshot: available via screen

I can see that the issue is due to incorrect xpath, but can't figure out why. 我可以看到问题是由于错误的xpath引起的,但无法弄清楚原因。

I should also point out that using selenium has occurred to me as being the best method to scrape this site, and intend to extract other values and repeat these queries for different stocks on a number of pages. 我还应该指出,使用硒是刮取该网站的最佳方法,并打算提取其他值并针对许多页面上的不同库存重复这些查询。 If anybody thinks I would be better with BeutifulSoup, lmxl etc then I am happy to hear suggestions! 如果有人认为BeutifulSoup,lmxl等会更好,那么我很高兴听到建议!

Thanks in advance! 提前致谢!

You dont even to declare all path . 您甚至都不声明所有路径。 Octagonal is in the div which class client-components-ValueChange-shape__Octagon so search this div. 八角形位于div中,该类是client-components-ValueChange-shape__Octagon因此请搜索此div。

x = browser.find_elements_by_css_selector("div[class='client-components-ValueChange-shape__Octagon']") ## Declare which class
for all in x:
    print all.text

Output : 输出:

6

You seem to have two issues here: 您似乎在这里有两个问题:

For the xpath, I just did: 对于xpath,我只是这样做了:

xpath = '//div[@class="client-components-ValueChange-shape__Octagon"]' xpath ='// div [@ class =“ client-components-ValueChange-shape__Octagon”]'

And then do: 然后执行:

print(element.text) 打印(element.text)

And it gets the value you want. 并获得您想要的价值。 However, your code doesn't actually wait to do the xpath until the browser has finished loading the page. 但是,您的代码实际上不会等到浏览器完成页面加载后才执行xpath。 For me, using Firefox, I only get the value about 40% of the time this way. 对我来说,使用Firefox,这种方式只能获得大约40%的时间价值。 There are many ways to handle this with Selenium, the simplest is probably to just sleep for a few seconds between the browser.get and the xpath statement. 使用Selenium处理此问题的方法有很多,最简单的方法可能是在browser.get和xpath语句之间睡眠几秒钟。

You seem to be setting up Firefox but then using Phantom. 您似乎正在设置Firefox,但随后使用了Phantom。 I did not try this with Phantom, the sleep behavior may be unnecessary with Phantom. 我没有使用Phantom尝试此操作,使用Phantom可能不需要睡眠行为。

You can try this css selector [class$='shape__Octagon'] to target the content. 您可以尝试使用此CSS选择器[class$='shape__Octagon']来定位内容。 If I went for pyppeteer , I would do like the following: 如果我去pyppeteer ,我想做以下事情:

import asyncio
from pyppeteer import launch

async def get_content(url):
    browser = await launch({"headless":True})
    [page] = await browser.pages()
    await page.goto(url)
    await page.waitForSelector("[class$='shape__Octagon']")
    value = await page.querySelectorEval("[class$='shape__Octagon']","e => e.innerText")
    return value

if __name__ == "__main__":
    url = "https://www.tipranks.com/stocks/sui/stock-analysis"
    loop = asyncio.get_event_loop()
    result = loop.run_until_complete(get_content(url))
    print(result.strip())

Output: 输出:

6

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM