简体   繁体   English

抓取雅虎股票新闻

[英]scraping yahoo stock news

I am scraping news articles related to Infosys at the end of page but getting error selenium.common.exceptions.InvalidSelectorException: Message: invalid selector.我在页面末尾抓取与 Infosys 相关的新闻文章,但收到错误 selenium.common.exceptions.InvalidSelectorException:消息:无效选择器。 Want to scrape all articles related to Infosys.想抓取所有与Infosys相关的文章。

from bs4 import BeautifulSoup
import re
from selenium import webdriver
import chromedriver_binary
import string
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

driver = webdriver.Chrome("/Users/abhishekgupta/Downloads/chromedriver")
driver.get("https://finance.yahoo.com/quote/INFY/news?p=INFY")

for i in range(20): # adjust integer value for need
       # you can change right side number for scroll convenience or destination 
       driver.execute_script("window.scrollBy(0, 250)")
       # you can change time integer to float or remove
       time.sleep(1)

print(driver.find_element_by_xpath('//*[@id="latestQuoteNewsStream-0-Stream"]/ul/li[9]/div/div/div[2]/h3/a/text()').text())

You could use less detailed xpath using // instead of /div/div/div[2]您可以使用不太详细的 xpath 使用//而不是/div/div/div[2]

And if you want last item then get all li as list and later use [-1] to get last element on list如果您想要最后一项,则将所有li作为列表获取,然后使用[-1]获取列表中的最后一个元素

from selenium import webdriver
import time

driver = webdriver.Chrome("/Users/abhishekgupta/Downloads/chromedriver")
#driver = webdriver.Firefox()

driver.get("https://finance.yahoo.com/quote/INFY/news?p=INFY")

for i in range(20):
       driver.execute_script("window.scrollBy(0, 250)")
       time.sleep(1)

all_items = driver.find_elements_by_xpath('//*[@id="latestQuoteNewsStream-0-Stream"]/ul/li')

#for item in all_items:
#    print(item.find_element_by_xpath('.//h3/a').text)
#    print(item.find_element_by_xpath('.//p').text)
#    print('---')
    
print(all_items[-1].find_element_by_xpath('.//h3/a').text)
print(all_items[-1].find_element_by_xpath('.//p').text)

xPath you provided does not exist in the page.您提供的 xPath 在页面中不存在。

Download the xPath Finder Chrome Extension to find the correct xPath for articles.下载xPath Finder Chrome 扩展程序,为文章找到正确的 xPath。

Here is an example xPath of articles list, you need to loop through id:下面是文章列表的示例 xPath,需要循环遍历 id:

/html/body/div[1]/div/div/div[1]/div/div[3]/div[1]/div/div[5]/div/div/div/ul/li[ID]/div/div/div[2]/h3/a/u

I think your code is fine just one thing: there are few difference when we retrieve text or links when using xpath in selenium as compare to scrapy or if you are using lxml fromstring library so here is something that should work for you我认为您的代码只是一件事:与 scrapy 相比,在 selenium 中使用 xpath 时检索文本或链接时几乎没有区别

#use this code for printing instead 
print(driver.find_element_by_xpath('//*[@id="latestQuoteNewsStream-0- Stream"]/ul/li[9]/div/div/div[2]/h3/a').text)

Even if you do this it will work the same way since there is only one element with this id so simply use即使您这样做,它也会以相同的方式工作,因为只有一个具有此 ID 的元素,所以只需使用

#This should also work fine
print(driver.find_element_by_xpath('//*[@id="latestQuoteNewsStream-0- Stream"]').text)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM