使用readtext进行编码

Question

I want to do some text analysis based on data stored as a .csv file, but I run into problems regarding the encoding with the readtext package. 我想根据存储为.csv文件的数据进行一些文本分析，但是我遇到了有关使用readtext包进行编码的问题。

To illustrate my problem, I created the following file in Excel, saving it as .csv (UTF-8): 为了说明我的问题，我在Excel中创建了以下文件，将其保存为.csv（UTF-8）：

|---------------------|------------------|
|      c_text         |       c_id       |
|---------------------|------------------|
|      München        |        aa        |
|---------------------|------------------|
|       Laïrie        |        bb        |
|---------------------|------------------|
|        Mános        |        cc        |
|---------------------|------------------|

Then, I load the data in R as follows: 然后，我按如下方式在R中加载数据：

text_raw <- readtext::readtext("path/test_encoding.csv"),
                   encoding = "UTF-8",
                   text_field = "c_text")
text_raw

The output is: 输出是：

readtext object consisting of 3 documents and 1 docvar.
# Description: data.frame [3 x 3]
  doc_id              text              c_id 
  <chr>               <chr>             <chr>
1 test_encoding.csv.1 "\"MÃ¼nchen\"..." aa   
2 test_encoding.csv.2 "\"LaÃ¯rie\"..."  bb   
3 test_encoding.csv.3 "\"MÃ¡nos\"..."   cc

If I then write the object to a .csv file, the output is once again different. 如果我然后将对象写入.csv文件，则输出再次不同。 The command write.csv(text_raw, file = "path", fileEncoding = "UTF-8") yields the following: 命令write.csv(text_raw, file = "path", fileEncoding = "UTF-8")产生以下结果：

MÃƒÂ¼nchen
LaÃƒÂ¯rie
MÃƒÂ¡nos

Some additional information: 一些其他信息：

I am using a Windows machine, and my sys.getLocale() is English_United Kingdom.1252 (apparently, this cannot be changed to UTF-8) 我使用的是Windows机器，我的sys.getLocale()是English_United Kingdom.1252 sys.getLocale() （显然，这不能改为UTF-8）
Even if I specify other encodings in the readtext() function, (eg, "utf8", "Windows-1252", "ISO8859-1"), the output doesn't change. 即使我在readtext()函数中指定其他编码（例如，“utf8”，“Windows-1252”，“ISO8859-1”），输出也不会改变。 However, given that I explicitly save the test file as utf-8, I don't understand what's going on. 但是，鉴于我明确将测试文件保存为utf-8，我不明白发生了什么。

Any help would be greatly appreciated. 任何帮助将不胜感激。 Thanks. 谢谢。

Answer 1

I created a pull request since this was an issue in the readtext package: 我创建了一个pull请求，因为这是readtext包中的一个问题：

https://github.com/quanteda/readtext/pull/151 https://github.com/quanteda/readtext/pull/151

Until this PR is either accepted or the problem otherwise fixed, you can use my fork to solve this problem: 在此PR被接受或问题得到解决之前，您可以使用我的fork来解决此问题：

remotes::install_github("JBGruber/readtext")

Update 更新

The PR was approved so install the new package version via: PR已获批准，因此请通过以下方式安装新软件包：

remotes::install_github("quanteda/readtext")

And then it should work: 它应该工作：

df <- structure(list(c_text = structure(c(3L, 1L, 2L), .Label = c("Laïrie", 
                                                                  "Mános", "München"), class = "factor"), c_id = structure(1:3, .Label = c("aa", 
                                                                                                                                           "bb", "cc"), class = "factor")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                -3L))
write.csv(df,
          "~/test.csv",
          row.names = FALSE,
          fileEncoding = "UTF-8")
text_raw <- readtext::readtext("~/test.csv",
                               encoding = "UTF-8",
                               text_field = "c_text")

text_raw
#> readtext object consisting of 3 documents and 1 docvar.
#> # Description: data.frame [3 x 3]
#>   doc_id     text             c_id 
#>   <chr>      <chr>            <chr>
#> 1 test.csv.1 "\"München\"..." aa   
#> 2 test.csv.2 "\"Laïrie\"..."  bb   
#> 3 test.csv.3 "\"Mános\"..."   cc

^{Created on 2019-05-02 by the reprex package (v0.2.1)} ^{由reprex包创建于2019-05-02（v0.2.1）}

使用readtext进行编码

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-05-01 21:34:21

Update 更新

使用readtext进行编码

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-05-01 21:34:21

Update 更新

解决方案1
0 已采纳 2019-05-01 21:34:21