[英]Encoding with readtext
I want to do some text analysis based on data stored as a .csv file, but I run into problems regarding the encoding with the readtext
package. 我想根据存储为.csv文件的数据进行一些文本分析,但是我遇到了有关使用
readtext
包进行编码的问题。
To illustrate my problem, I created the following file in Excel, saving it as .csv (UTF-8): 为了说明我的问题,我在Excel中创建了以下文件,将其保存为.csv(UTF-8):
|---------------------|------------------|
| c_text | c_id |
|---------------------|------------------|
| München | aa |
|---------------------|------------------|
| Laïrie | bb |
|---------------------|------------------|
| Mános | cc |
|---------------------|------------------|
Then, I load the data in R as follows: 然后,我按如下方式在R中加载数据:
text_raw <- readtext::readtext("path/test_encoding.csv"),
encoding = "UTF-8",
text_field = "c_text")
text_raw
The output is: 输出是:
readtext object consisting of 3 documents and 1 docvar.
# Description: data.frame [3 x 3]
doc_id text c_id
<chr> <chr> <chr>
1 test_encoding.csv.1 "\"München\"..." aa
2 test_encoding.csv.2 "\"Laïrie\"..." bb
3 test_encoding.csv.3 "\"Mános\"..." cc
If I then write the object to a .csv file, the output is once again different. 如果我然后将对象写入.csv文件,则输出再次不同。 The command
write.csv(text_raw, file = "path", fileEncoding = "UTF-8")
yields the following: 命令
write.csv(text_raw, file = "path", fileEncoding = "UTF-8")
产生以下结果:
München
Laïrie
Mános
Some additional information: 一些其他信息:
I am using a Windows machine, and my sys.getLocale()
is English_United Kingdom.1252
(apparently, this cannot be changed to UTF-8) 我使用的是Windows机器,我的
sys.getLocale()
是English_United Kingdom.1252
sys.getLocale()
(显然,这不能改为UTF-8)
Even if I specify other encodings in the readtext()
function, (eg, "utf8", "Windows-1252", "ISO8859-1"), the output doesn't change. 即使我在
readtext()
函数中指定其他编码(例如,“utf8”,“Windows-1252”,“ISO8859-1”),输出也不会改变。 However, given that I explicitly save the test file as utf-8, I don't understand what's going on. 但是,鉴于我明确将测试文件保存为utf-8,我不明白发生了什么。
Any help would be greatly appreciated. 任何帮助将不胜感激。 Thanks.
谢谢。
I created a pull request since this was an issue in the readtext
package: 我创建了一个pull请求,因为这是
readtext
包中的一个问题:
https://github.com/quanteda/readtext/pull/151 https://github.com/quanteda/readtext/pull/151
Until this PR is either accepted or the problem otherwise fixed, you can use my fork to solve this problem: 在此PR被接受或问题得到解决之前,您可以使用我的fork来解决此问题:
remotes::install_github("JBGruber/readtext")
The PR was approved so install the new package version via: PR已获批准,因此请通过以下方式安装新软件包:
remotes::install_github("quanteda/readtext")
And then it should work: 它应该工作:
df <- structure(list(c_text = structure(c(3L, 1L, 2L), .Label = c("Laïrie",
"Mános", "München"), class = "factor"), c_id = structure(1:3, .Label = c("aa",
"bb", "cc"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
write.csv(df,
"~/test.csv",
row.names = FALSE,
fileEncoding = "UTF-8")
text_raw <- readtext::readtext("~/test.csv",
encoding = "UTF-8",
text_field = "c_text")
text_raw
#> readtext object consisting of 3 documents and 1 docvar.
#> # Description: data.frame [3 x 3]
#> doc_id text c_id
#> <chr> <chr> <chr>
#> 1 test.csv.1 "\"München\"..." aa
#> 2 test.csv.2 "\"Laïrie\"..." bb
#> 3 test.csv.3 "\"Mános\"..." cc
Created on 2019-05-02 by the reprex package (v0.2.1) 由reprex包创建于2019-05-02(v0.2.1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.