简体   繁体   中英

Web page not found in web scraping, how can I find it in R?

I've been working with R for about a year and love it. I've gotten into text mining recently and have had some difficulty. I'm trying to create a data frame with information from a website. I've been scraping the data and have been able to create two variables successfully. In attempting to create the third variable its not working. When I view the table that I've made, the content for that variable says "Sorry webpage cannot be found." But, I know its there? Any thoughts? Thanks everyone!

link = "https://www.fmprc.gov.cn/mfa_eng/wjdt_665385/zyjh_665391/"
page = read_html(link)
title = page %>% html_nodes(".newsLst_mod a") %>% html_text()
slinks = page %>% html_nodes(".newsLst_mod a") %>%
  html_attr("href") %>% paste("https://www.fmprc.gov.cn", ., sep = "")
date = page %>% html_nodes(".newsLst_mod span") %>% html_text()

Somewhere here is where I run into trouble... I get 'p' when using Selector Gadget and put that in the html_ nodes function...however, this doesn't seem to work and I'm coming up empty. If I adjust the scraping a little on the page, it might have nothing on the table when I view it.

get_s = function(slinks) {
  speeches_link = read_html(slinks)
  speech_words = speeches_link %>% html_nodes("p") %>%
    html_text() %>% paste(collapse = ",")
  return(speech_words)
}

What the table looks like

words = sapply(slinks, FUN = get_s)
speeches = data.frame(title, date, words, stringsAsFactors = FALSE)

The link that you need to paste in each URL is https://www.fmprc.gov.cn/mfa_eng/wjdt_665385/zyjh_665391 .

Try the following -

library(rvest)

slinks = page %>% html_nodes(".newsLst_mod a") %>%
  html_attr("href") %>% trimws(whitespace = '\\.')  %>% 
  paste0("https://www.fmprc.gov.cn/mfa_eng/wjdt_665385/zyjh_665391", .)

get_s = function(slinks) {
  speeches_link = read_html(slinks)
  speech_words = speeches_link %>% html_nodes("p") %>%
    html_text() %>% paste(collapse = ",")
  return(speech_words)
}

words = sapply(slinks, FUN = get_s)
words

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM