[英]R special characters from html tables
I'm working on a simple script to scrape data from HTML tables. 我正在研究一个简单的脚本来从HTML表中抓取数据。 Problem is that table contains special characters, even if it says it's downloaded as utf-8.
问题是该表包含特殊字符,即使它说已下载为utf-8。
library(XML)
webpage.Name <- "http://www.registeruz.sk/cruz-public/domain/financialreport/show/4817607"
webpage.Name.table <- readHTMLTable(webpage.Name, header=T, which=1,stringsAsFactors=F)
Example of data scraped: 抓取数据的示例:
V1 V2
1 Mimoriadna <NA>
2 <NA>
3 Ă<U+009A>ÄŤtovná jednotka: malá
4 DaĹ<U+0088>ovĂ© identifikaÄŤnĂ© ÄŤĂslo: 2023790373
I tried using gsub and changing certain paterns but it doesn't seem to work. 我尝试使用gsub并更改某些模式,但似乎不起作用。 Same with iconv from utf-8 to latin1.
与iconv相同,从utf-8到latin1。 It doesn't matter if the data after the scraping contains special characters or not.
抓取后的数据是否包含特殊字符都没有关系。
Use encoding = "UTF-8"
in readHTMLTable()
在
readHTMLTable()
使用encoding = "UTF-8"
df <- readHTMLTable(webpage.Name,
header = TRUE, which = 1, stringsAsFactors = FALSE, encoding = "UTF-8")
head(df, 4)
# V1 V2
# 1 Mimoriadna <NA>
# 2 <NA>
# 3 Účtovná jednotka: malá
# 4 Daňové identifikačné číslo: 2023790373
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.