简体   繁体   English

如何从简体中文网站抓取内容?

[英]How can I scrape content from a website that's in Simplified Chinese?

I have tested this code on various English language websites with no problem.我已经在各种英语网站上测试了这段代码,没有问题。 However, when I tried to scrape content from a website that's in Simplified Chinese, the data appeared as gibberish in the CSV file.但是,当我尝试从简体中文网站抓取内容时,数据在 CSV 文件中显示为乱码。 In addition, the body of the article was spread out over multiple rows in Excel, not contained in one cell.此外,文章的正文在 Excel 中分散在多行中,而不是包含在一个单元格中。 Can someone help?有人可以帮忙吗?

install.packages("rvest")
library(rvest)

###specifying the URL for the website you want to scrape ###
url <-'https://new.qq.com/omn/20190823/20190823A02W4Q00.html'

##reading the HTML code from the website
webpage <- read_html(url)

###using CSS selectors to scrape the title
title_html <- html_nodes(webpage,'h1')

###Converting the main text data to text
title_data <- html_text(title_html)

###using CSS selectors to scrape the body
text_html <- html_nodes(webpage,'.one-p')

###Converting the body data to text
text_data <- html_text(text_html)


d <- data.frame(text_data)
write.csv(d,"chinesetext.csv")

Most of these problems are caused encoding.大多数这些问题都是由编码引起的。 I try guess_encoding function.我尝试guess_encoding函数。 And it guessed UTF-8 encoding.它猜到了 UTF-8 编码。 But it's not working.但它不起作用。 You can see this code.你可以看到这个代码。

Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  : 
input conversion failed due to input error, bytes 0xC8 0xDD 0x2D 0x2D [6003]

So i change using Extended Unix Code.所以我使用扩展的 Unix 代码进行更改。 It's working.它正在工作。

url <-'https://new.qq.com/omn/20190823/20190823A02W4Q00.html'
webpage <- read_html(url, encoding="euc-cn")
title_html <- html_nodes(webpage,'h1')
title_data <- html_text(title_html)
title_data
[1] "“六稳”政策显效 抗压能力增强"

Perhaps, you want to transform data frame in chinese language.也许,您想用中文转换数据框。 Before your code, add this code.在您的代码之前,添加此代码。 Then you can see chinese language in global environment.然后你可以在全球环境中看到中文。

Sys.setlocale("LC_ALL", "Chinese")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM