使用 for 循环在 R 中抓取网页

Question

I would like to scrape the data from this link , and I have written the following code in R to do so.我想从这个链接中抓取数据，我已经在 R 中编写了以下代码来做到这一点。 This, however, does not work and only returns the first page of the results.但是，这不起作用并且只返回结果的第一页。 Apparently, the loop does not work.显然，循环不起作用。 Does anybody know what's wrong with the loop?有人知道循环出了什么问题吗？

library('rvest')

for (i in 1:40) {

     webpage <- read_html(paste0(("http://search.beaconforfreedom.org/search/censored_publications/result.html?author=&cauthor=&title=&country=7327&language=&censored_year=&censortype=&published_year=&censorreason=&sort=t&page=, i"))

     rank_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(1)')

     rank_data <- html_text(rank_data_html)

     rank_data<-as.numeric(rank_data)

     title_data_html <- html_nodes(webpage,'.censo_list font')

     title_data <- html_text(title_data_html)

     author_data_html <- html_nodes(webpage,'.censo_list+ td font')
     author_data <- html_text(author_data_html)

     country_data_html <- html_nodes(webpage,'.censo_list~ td:nth-child(4) font')

     rcountry_data <- html_text(country_data_html)

     year_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(5) font')

     year_data <- html_text(year_data_html)

     type_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(6) font')

     type_data <- html_text(type_data_html)

}

censorship_df<-data.frame(Rank = rank_data, Title = title_data, Author = author_data, Country = rcountry_data, Type = type_data, Year = year_data)

write.table(censorship_df, file="sample.csv",sep=",",row.names=F)

Answer 1

Are you sure there's anything wrong with the loop?你确定循环有问题吗？ I would expect it to get the first page of results 40 times.我希望它获得 40 次结果的第一页。 Look at看着

webpage <- read_html(paste0(("http://search.beaconforfreedom.org/search/censored_publications/result.html?author=&cauthor=&title=&country=7327&language=&censored_year=&censortype=&published_year=&censorreason=&sort=t&page=, i"))

Shouldn't that be (difference in the last ten characters of the string; the quotation mark moves)不应该是（字符串的最后十个字符的差异；引号移动）

webpage <- read_html(paste0(("http://search.beaconforfreedom.org/search/censored_publications/result.html?author=&cauthor=&title=&country=7327&language=&censored_year=&censortype=&published_year=&censorreason=&sort=t&page=", i))

What paste0 does in R is it stitches together two strings without any separator. paste0在 R 中的作用是将两个字符串拼接在一起，没有任何分隔符。 But you only have one string.但是你只有一根绳子。 So it tries to fetch results for page=, i .所以它尝试获取page=, i 。 But you want it to fetch page=1 through page=40 .但是您希望它通过page=40获取page=1 。 So put the quotation mark like page=", i so that it pastes the URL and i together.所以把引号像page=", i这样它把 URL 和i粘贴在一起。

I'm not an R programmer, but that simply leaps out at me.我不是 R 程序员，但这只是让我眼前一亮。

Source for paste0 behavior. paste0行为的来源。

使用 for 循环在 R 中抓取网页

问题描述

1 个解决方案

解决方案1
0 2019-02-20 02:43:05

使用 for 循环在 R 中抓取网页

问题描述

1 个解决方案

解决方案1 0 2019-02-20 02:43:05

解决方案1
0 2019-02-20 02:43:05