简体   繁体   English

计数html文档中的单词

[英]Counting words in html documents

I want to count words in html articles using R. Scraping data like titles works nice and i was able to download the articles (code below). 我想使用R对html文章中的单词进行计数。像标题一样刮取数据效果很好,我能够下载文章(下面的代码)。 Now i want to count words in all of those articles, for example the word "Merkel". 现在我想在所有这些文章中计算单词,例如单词“ Merkel”。

It seems to be a bitcomplicated. 这似乎有点复杂。 I was able to make it work with the headlines (throw every headlines in 1 vector and count the words), but that was too detailed and too much code (because i had to throw the headlines for each month manually together if there are more than 1 page results in the search) and thats why i wont post all the code here ( i´m sure it can be easy but thats another problem). 我能够使其与标题一起使用(将每个标题放入1个向量中并计算单词数),但这太详细了,代码也太多了(因为如果有多个标题,我不得不将每个月的标题手动地放在一起) 1页会导致搜索),这就是为什么我不会在这里发布所有代码(我敢肯定这很容易,但这是另一个问题)。

I think i messed something up and thats why i couldn´t do the same with the html articles. 我认为我搞砸了,这就是为什么我不能对html文章做同样的事情。 The difference is that i scraped the titles direclty but the html files i had to download first. 所不同的是,我刮掉了标题的重复性,但必须首先下载html文件。

So how can i go through my 10000 (here only 45) html pages and look for some nice keywoards? 那么,我该如何浏览我的10000个(这里只有45个)html页面,并寻找一些不错的键盘? Example for january; 一月的例子; I download the articles with this code; 我用此代码下载文章;

library(xml2)
library(rvest)
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
link_nodes <- html_nodes(url_parsed1, css = ".entrylist__link")
html_links <- html_attr(link_nodes, "href")
getwd()
dir.create("html_articles")
setwd("html_articles")
for (url in html_links) {
 newName <- paste (basename(url),".html")
download.file(url, destfile = newName)
}

Thanks a lot for your help! 非常感谢你的帮助!

I hope i understood your question correctly: 希望我能正确理解您的问题:

library(xml2)
library(rvest)
library(XML)
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
link_nodes <- html_nodes(url_parsed1, css = ".entrylist__link")
html_links <- html_attr(link_nodes, "href")
getwd()
dir.create("html_articles")
setwd("html_articles")
for (url_org in html_links) { 
  # url_org <- html_links[1]
  newName <- paste (basename(url_org),".html")

  download.file(url_org, destfile = newName)
  # Read and parse HTML file
  doc.html <- htmlTreeParse(url_org,
                useInternal = TRUE)
  # Extract all the paragraphs (HTML tag is p, starting at
  # the root of the document). Unlist flattens the list to
  # create a character vector.
  doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))
  # Replace all \n by spaces
  doc.text = gsub('\\n', ' ', doc.text)

  # Join all the elements of the character vector into a single
  # character string, separated by spaces
  doc.text = paste(doc.text, collapse = ' ')
  # count the occurences of the word "Merkel in that hmtl
  str_count(doc.text,"Merkel")
}

I would like to pass the credits to here and here 我想将学分传递到这里这里

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM