简体   繁体   English

网页抓取 RSelenium findElement

[英]web scraping RSelenium findElement

I feel this is supposed to be simple but I have been struggled to get it right.我觉得这应该很简单,但我一直在努力让它正确。 I'm trying to extract the Employees number ("2,300,000") from this webpage: https://fortune.com/company/walmart/我正在尝试从此网页中提取员工编号(“2,300,000”): https ://fortune.com/company/walmart/

I used Chrome's extension SelectorGadget to locate the number---"info__row--7f9lE:nth-child(13) .info__value--2AHH7""我使用 Chrome 的扩展 SelectorGadget 来定位数字---"info__row--7f9lE:nth-child(13) .info__value--2AHH7""

```
library(RSelenium)
library(rvest)
library(netstat)

rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=free_port())
remDr<-rs_driver_object$client
remDr$navigate('https://fortune.com/company/walmart/')
Employees<-remDr$findElement(using = 'xpath','//h3[@class="info__row--7f9lE:nth-child(13) .info__value--2AHH7"]')
Employees
```

An error says 

> "Selenium message:no such element: Unable to locate element".

I have also tried:
```
Employees<-remDr$findElement(using = 'class name','info__value--2AHH7')
```
But it returns the data not as wanted. 


Can someone point out the problem? Really appreciate it! 

Updated I modified the code as suggested by Frodo below in the comment to apply to multiple webpages to save the statistics as a dataframe.更新后,我按照下面评论中 Frodo 的建议修改了代码,以应用于多个网页,以将统计信息保存为数据框。 But I still encountered an error.但是我还是遇到了错误。

    library(RSelenium)
    library(rvest)
    library(netstat)
    
rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=netstat::free_port())
remDr<-rs_driver_object$client


Data<-data.frame("url" = c("https://fortune.com/company/walmart/", "https://fortune.com/company/amazon-com/"              
                           ,"https://fortune.com/company/apple/"                   
                           ,"https://fortune.com/company/cvs-health/" 
                           ,"https://fortune.com/company/jpmorgan-chase/"          
                           ,"https://fortune.com/company/verizon/"                 
                           ,"https://fortune.com/company/ford-motor/"              
                           , "https://fortune.com/company/general-motors/"          
                           ,"https://fortune.com/company/anthem/"                  
                           , "https://fortune.com/company/centene/"                 
                           ,"https://fortune.com/company/fannie-mae/"              
                           , "https://fortune.com/company/comcast/"                 
                           , "https://fortune.com/company/chevron/"                 
                           ,"https://fortune.com/company/dell-technologies/"       
                           ,"https://fortune.com/company/bank-of-america-corp/"    
                           ,"https://fortune.com/company/target/") )

Data$numEmp<-"NA"
Data$numEmp <- numeric()



for (i in 1:length(Data$url))
  {
  
remDr$navigate(url = Data$url[i])
pgSrc <- remDr$getPageSource()
pgCnt <- read_html(pgSrc[[1]])
Data$numEmp[i] <- pgCnt %>%
  html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
  html_text(trim = TRUE)

}
Data$numEmp

Selenium message:unknown error: unexpected command response (Session info: chrome=103.0.5060.114) Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10' System info: host: 'DESKTOP-VCCIL8P', ip: '192.168.1.249', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_311' Driver info: driver.version: unknown Selenium 消息:未知错误:意外命令响应(会话信息:chrome=103.0.5060.114)构建信息:版本:'4.0.0-alpha-2',修订:'f148142cf8',时间:'2019-07-01T21:30 :10'系统信息:主机:'DESKTOP-VCCIL8P',ip:'192.168.1.249',os.name:'Windows 10',os.arch:'amd64',os.version:'10.0',java.version :“1.8.0_311”驱动程序信息:驱动程序版本:未知

Error: Summary: UnknownError Detail: An unknown server-side error occurred while processing the command.错误:摘要:UnknownError 详细信息:处理命令时发生未知的服务器端错误。 class: org.openqa.selenium.WebDriverException Further Details: run errorDetails method类:org.openqa.selenium.WebDriverException 更多细节:运行 errorDetails 方法

Can someone please take another look?有人可以再看看吗?

Use RSelenium to load up the webpage and get the page source使用RSelenium加载网页并获取页面源

remdr$navigate(url = 'https://fortune.com/company/walmart/')
pgSrc <- remdr$getPageSource()

Use Rvest to read the contents of the webpage使用Rvest读取网页内容

pgCnt <- read_html(pgSrc[[1]])

Further, use rvest::html_nodes and rvest::html_text functions to extract the text using relevant xpath selectors.此外,使用rvest::html_nodesrvest::html_text函数使用相关的xpath选择器提取文本。 (this Chrome extension should help) (这个Chrome 扩展程序应该会有所帮助)

reqTxt <- pgCnt %>%
  html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
  html_text(trim = TRUE)

Output of reqTxt reqTxt的输出

> reqTxt
[1] "2,300,000"

UPDATE更新

The error Selenium message:unknown error: unexpected command response seems to be occurring specifically 103 version of Chromedriver.错误Selenium message:unknown error: unexpected command response似乎是专门发生在 103 版本的 Chromedriver 中。 More info here .更多信息在这里 One of the answers there was a giving a simple wait of 5 seconds before and after the driver navigates to the URL.答案之一是在驱动程序导航到 URL 之前和之后简单地等待 5 秒。 And I have also used tryCatch to keep continuing the code to run within a while loop.而且我还使用tryCatch继续在 while 循环内运行代码。 Essentially, the code will run until it loads the page.本质上,代码将一直运行,直到它加载页面。 This seems to work.这似乎有效。

# Function to fetch employee count
getEmployees <- function(myURL) {
  pagestatus <<- 0
  while(pagestatus == 0) {
    tryCatch(
      expr = remDr$navigate(url = myURL),
      pagestatus <<- 1,
      error = function(error){
        pagestatus <<- 0
        
      }  
    )
  }
  pgSrc <- remDr$getPageSource()
  pgCnt <- read_html(pgSrc[[1]])
  return(pgCnt %>% html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>% html_text(trim = TRUE))
}

Implement this function to all of your dataframe URLs.将此功能应用于您的所有数据框 URL。

for(i in 1:nrow(Data)) {
  Sys.sleep(5)
  Data[i, 2] <- getEmployees(Data[i, 1])
  Sys.sleep(5)
}

Now when we see the output of second column现在,当我们看到第二列的输出时

> Data[, 2]
 [1] "2,300,000" "1,608,000" "154,000"   "258,000"   "271,025"   "118,400"  
 [7] "183,000"   "157,000"   "98,200"    "72,500"    "7,400"     "189,000"  
[13] "42,595"    "133,000"   "208,248"   "450,000"  

Does it have to be with RSelenium only?是否必须仅与 RSelenium 一起使用? In my experience, the most flexible approach is to use RSelenium to navigate to the required pages (where findElement helps you find boxes to enter text into or buttons to click) and then use rvest to extract what you need from the page.根据我的经验,最灵活的方法是使用 RSelenium 导航到所需的页面(其中 findElement 可以帮助您找到输入文本的框或单击的按钮),然后使用 rvest 从页面中提取您需要的内容。

Start with从...开始

rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=netstat::free_port())
remDr<-rs_driver_object$client
remDr$navigate('https://fortune.com/company/walmart/')
page_source <- remDr$getPageSource()
pg <- xml2::read_html(page_source[[1]])

How you then go about it depends on how specific you want the solution to be wrt this exact page.然后你如何去做取决于你希望解决方案在这个确切的页面上的具体程度。 Here is one way:这是一种方法:

rvest::html_elements(pg, "div.info__row--7f9lE") |> 
  rvest::html_text2()

or或者

rvest::html_elements(pg, "div:nth-child(13) > div.info__value--2AHH7") |> 
  rvest::html_text2()

or或者

rvest::html_elements(pg, "div.info__row--7f9lE")[11] |> 
  rvest::html_children()

or或者

rvest::html_elements(pg, '.info__row--7f9lE:nth-child(13) .info__value--2AHH7') |> 
  rvest::html_text2()

et cetera.等等。 What you do in the rvest part would depend on how general you want the selection/extraction process to be.您在 rvest 部分中所做的工作将取决于您希望选择/提取过程的一般性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM