[英]Web scraping in R using for loop
I would like to scrape the data from this link , and I have written the following code in R to do so.我想从这个链接中抓取数据,我已经在 R 中编写了以下代码来做到这一点。 This, however, does not work and only returns the first page of the results.
但是,这不起作用并且只返回结果的第一页。 Apparently, the loop does not work.
显然,循环不起作用。 Does anybody know what's wrong with the loop?
有人知道循环出了什么问题吗?
library('rvest')
for (i in 1:40) {
webpage <- read_html(paste0(("http://search.beaconforfreedom.org/search/censored_publications/result.html?author=&cauthor=&title=&country=7327&language=&censored_year=&censortype=&published_year=&censorreason=&sort=t&page=, i"))
rank_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(1)')
rank_data <- html_text(rank_data_html)
rank_data<-as.numeric(rank_data)
title_data_html <- html_nodes(webpage,'.censo_list font')
title_data <- html_text(title_data_html)
author_data_html <- html_nodes(webpage,'.censo_list+ td font')
author_data <- html_text(author_data_html)
country_data_html <- html_nodes(webpage,'.censo_list~ td:nth-child(4) font')
rcountry_data <- html_text(country_data_html)
year_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(5) font')
year_data <- html_text(year_data_html)
type_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(6) font')
type_data <- html_text(type_data_html)
}
censorship_df<-data.frame(Rank = rank_data, Title = title_data, Author = author_data, Country = rcountry_data, Type = type_data, Year = year_data)
write.table(censorship_df, file="sample.csv",sep=",",row.names=F)
Are you sure there's anything wrong with the loop?你确定循环有问题吗? I would expect it to get the first page of results 40 times.
我希望它获得 40 次结果的第一页。 Look at
看着
webpage <- read_html(paste0(("http://search.beaconforfreedom.org/search/censored_publications/result.html?author=&cauthor=&title=&country=7327&language=&censored_year=&censortype=&published_year=&censorreason=&sort=t&page=, i"))
Shouldn't that be (difference in the last ten characters of the string; the quotation mark moves)不应该是(字符串的最后十个字符的差异;引号移动)
webpage <- read_html(paste0(("http://search.beaconforfreedom.org/search/censored_publications/result.html?author=&cauthor=&title=&country=7327&language=&censored_year=&censortype=&published_year=&censorreason=&sort=t&page=", i))
What paste0
does in R is it stitches together two strings without any separator. paste0
在 R 中的作用是将两个字符串拼接在一起,没有任何分隔符。 But you only have one string.但是你只有一根绳子。 So it tries to fetch results for
page=, i
.所以它尝试获取
page=, i
。 But you want it to fetch page=1
through page=40
.但是您希望它通过
page=40
获取page=1
。 So put the quotation mark like page=", i
so that it pastes the URL and i
together.所以把引号像
page=", i
这样它把 URL 和i
粘贴在一起。
I'm not an R programmer, but that simply leaps out at me.我不是 R 程序员,但这只是让我眼前一亮。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.