html表中的R个特殊字符

Question

I'm working on a simple script to scrape data from HTML tables. 我正在研究一个简单的脚本来从HTML表中抓取数据。 Problem is that table contains special characters, even if it says it's downloaded as utf-8. 问题是该表包含特殊字符，即使它说已下载为utf-8。

 library(XML)
 webpage.Name <- "http://www.registeruz.sk/cruz-public/domain/financialreport/show/4817607"
 webpage.Name.table <- readHTMLTable(webpage.Name, header=T, which=1,stringsAsFactors=F)

Example of data scraped: 抓取数据的示例：

     V1                                             V2
1  Mimoriadna                                      <NA>
2                                                  <NA>
3  Ă<U+009A>ÄŤtovnĂˇ jednotka:                     malĂˇ
4  DaĹ<U+0088>ovĂ© identifikaÄŤnĂ© ÄŤĂslo:      2023790373

I tried using gsub and changing certain paterns but it doesn't seem to work. 我尝试使用gsub并更改某些模式，但似乎不起作用。 Same with iconv from utf-8 to latin1. 与iconv相同，从utf-8到latin1。 It doesn't matter if the data after the scraping contains special characters or not. 抓取后的数据是否包含特殊字符都没有关系。

Answer 1

Use encoding = "UTF-8" in readHTMLTable() 在readHTMLTable()使用encoding = "UTF-8"

df <- readHTMLTable(webpage.Name, 
    header = TRUE, which = 1, stringsAsFactors = FALSE, encoding = "UTF-8")
head(df, 4)
#                            V1                          V2
# 1                  Mimoriadna                        <NA>
# 2                                                    <NA>
# 3           Účtovná jednotka:                        malá
# 4 Daňové identifikačné číslo:                  2023790373

html表中的R个特殊字符

问题描述

1 个解决方案

解决方案1
4 已采纳 2015-10-10 21:25:29

html表中的R个特殊字符

问题描述

1 个解决方案

解决方案1 4 已采纳 2015-10-10 21:25:29

解决方案1
4 已采纳 2015-10-10 21:25:29