计数html文档中的单词

Question

I want to count words in html articles using R. Scraping data like titles works nice and i was able to download the articles (code below). 我想使用R对html文章中的单词进行计数。像标题一样刮取数据效果很好，我能够下载文章（下面的代码）。 Now i want to count words in all of those articles, for example the word "Merkel". 现在我想在所有这些文章中计算单词，例如单词“ Merkel”。

It seems to be a bitcomplicated. 这似乎有点复杂。 I was able to make it work with the headlines (throw every headlines in 1 vector and count the words), but that was too detailed and too much code (because i had to throw the headlines for each month manually together if there are more than 1 page results in the search) and thats why i wont post all the code here ( i´m sure it can be easy but thats another problem). 我能够使其与标题一起使用（将每个标题放入1个向量中并计算单词数），但这太详细了，代码也太多了（因为如果有多个标题，我不得不将每个月的标题手动地放在一起） 1页会导致搜索），这就是为什么我不会在这里发布所有代码（我敢肯定这很容易，但这是另一个问题）。

I think i messed something up and thats why i couldn´t do the same with the html articles. 我认为我搞砸了，这就是为什么我不能对html文章做同样的事情。 The difference is that i scraped the titles direclty but the html files i had to download first. 所不同的是，我刮掉了标题的重复性，但必须首先下载html文件。

So how can i go through my 10000 (here only 45) html pages and look for some nice keywoards? 那么，我该如何浏览我的10000个（这里只有45个）html页面，并寻找一些不错的键盘？ Example for january; 一月的例子； I download the articles with this code; 我用此代码下载文章；

library(xml2)
library(rvest)
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
link_nodes <- html_nodes(url_parsed1, css = ".entrylist__link")
html_links <- html_attr(link_nodes, "href")
getwd()
dir.create("html_articles")
setwd("html_articles")
for (url in html_links) {
 newName <- paste (basename(url),".html")
download.file(url, destfile = newName)
}

Thanks a lot for your help! 非常感谢你的帮助！

Answer 1

I hope i understood your question correctly: 希望我能正确理解您的问题：

library(xml2)
library(rvest)
library(XML)
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
link_nodes <- html_nodes(url_parsed1, css = ".entrylist__link")
html_links <- html_attr(link_nodes, "href")
getwd()
dir.create("html_articles")
setwd("html_articles")
for (url_org in html_links) { 
  # url_org <- html_links[1]
  newName <- paste (basename(url_org),".html")

  download.file(url_org, destfile = newName)
  # Read and parse HTML file
  doc.html <- htmlTreeParse(url_org,
                useInternal = TRUE)
  # Extract all the paragraphs (HTML tag is p, starting at
  # the root of the document). Unlist flattens the list to
  # create a character vector.
  doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))
  # Replace all \n by spaces
  doc.text = gsub('\\n', ' ', doc.text)

  # Join all the elements of the character vector into a single
  # character string, separated by spaces
  doc.text = paste(doc.text, collapse = ' ')
  # count the occurences of the word "Merkel in that hmtl
  str_count(doc.text,"Merkel")
}

I would like to pass the credits to here and here 我想将学分传递到这里和这里

计数html文档中的单词

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-01-03 15:04:05

计数html文档中的单词

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-01-03 15:04:05

解决方案1
0 已采纳 2018-01-03 15:04:05