简体   繁体   English

使用readtext进行编码

[英]Encoding with readtext

I want to do some text analysis based on data stored as a .csv file, but I run into problems regarding the encoding with the readtext package. 我想根据存储为.csv文件的数据进行一些文本分析,但是我遇到了有关使用readtext包进行编码的问题。

To illustrate my problem, I created the following file in Excel, saving it as .csv (UTF-8): 为了说明我的问题,我在Excel中创建了以下文件,将其保存为.csv(UTF-8):

|---------------------|------------------|
|      c_text         |       c_id       |
|---------------------|------------------|
|      München        |        aa        |
|---------------------|------------------|
|       Laïrie        |        bb        |
|---------------------|------------------|
|        Mános        |        cc        |
|---------------------|------------------|

Then, I load the data in R as follows: 然后,我按如下方式在R中加载数据:

text_raw <- readtext::readtext("path/test_encoding.csv"),
                   encoding = "UTF-8",
                   text_field = "c_text")
text_raw

The output is: 输出是:

readtext object consisting of 3 documents and 1 docvar.
# Description: data.frame [3 x 3]
  doc_id              text              c_id 
  <chr>               <chr>             <chr>
1 test_encoding.csv.1 "\"München\"..." aa   
2 test_encoding.csv.2 "\"Laïrie\"..."  bb   
3 test_encoding.csv.3 "\"Mános\"..."   cc 

If I then write the object to a .csv file, the output is once again different. 如果我然后将对象写入.csv文件,则输出再次不同。 The command write.csv(text_raw, file = "path", fileEncoding = "UTF-8") yields the following: 命令write.csv(text_raw, file = "path", fileEncoding = "UTF-8")产生以下结果:

München
Laïrie
Mános

Some additional information: 一些其他信息:

  • I am using a Windows machine, and my sys.getLocale() is English_United Kingdom.1252 (apparently, this cannot be changed to UTF-8) 我使用的是Windows机器,我的sys.getLocale()English_United Kingdom.1252 sys.getLocale() (显然,这不能改为UTF-8)

  • Even if I specify other encodings in the readtext() function, (eg, "utf8", "Windows-1252", "ISO8859-1"), the output doesn't change. 即使我在readtext()函数中指定其他编码(例如,“utf8”,“Windows-1252”,“ISO8859-1”),输出也不会改变。 However, given that I explicitly save the test file as utf-8, I don't understand what's going on. 但是,鉴于我明确将测试文件保存为utf-8,我不明白发生了什么。

Any help would be greatly appreciated. 任何帮助将不胜感激。 Thanks. 谢谢。

I created a pull request since this was an issue in the readtext package: 我创建了一个pull请求,因为这是readtext包中的一个问题:

https://github.com/quanteda/readtext/pull/151 https://github.com/quanteda/readtext/pull/151

Until this PR is either accepted or the problem otherwise fixed, you can use my fork to solve this problem: 在此PR被接受或问题得到解决之前,您可以使用我的fork来解决此问题:

remotes::install_github("JBGruber/readtext")

Update 更新

The PR was approved so install the new package version via: PR已获批准,因此请通过以下方式安装新软件包:

remotes::install_github("quanteda/readtext")

And then it should work: 它应该工作:

df <- structure(list(c_text = structure(c(3L, 1L, 2L), .Label = c("Laïrie", 
                                                                  "Mános", "München"), class = "factor"), c_id = structure(1:3, .Label = c("aa", 
                                                                                                                                           "bb", "cc"), class = "factor")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                -3L))
write.csv(df,
          "~/test.csv",
          row.names = FALSE,
          fileEncoding = "UTF-8")
text_raw <- readtext::readtext("~/test.csv",
                               encoding = "UTF-8",
                               text_field = "c_text")

text_raw
#> readtext object consisting of 3 documents and 1 docvar.
#> # Description: data.frame [3 x 3]
#>   doc_id     text             c_id 
#>   <chr>      <chr>            <chr>
#> 1 test.csv.1 "\"München\"..." aa   
#> 2 test.csv.2 "\"Laïrie\"..."  bb   
#> 3 test.csv.3 "\"Mános\"..."   cc

Created on 2019-05-02 by the reprex package (v0.2.1) reprex包创建于2019-05-02(v0.2.1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM