简体   繁体   English

write.xlsx (openxlsx) 的编码问题

[英]Encoding issue with write.xlsx (openxlsx)

I use the write.xlsx() function (from the openxlsx package) to turn a list object into an excel spreadsheet, where each element of the list is converted into a "sheet" of the excel file.我使用write.xlsx()函数(来自openxlsx包)将列表对象转换为 excel 电子表格,其中列表的每个元素都转换为 excel 文件的“工作表”。 In the past, this function has been incredibly useful, and I have never encountered any issues.过去,此功能非常有用,我从未遇到过任何问题。 It is my understanding that this package and function in particular does not need any particular java update on the computer in order for it to work.我的理解是,这个包和功能尤其不需要在计算机上进行任何特定的 Java 更新即可使其工作。

However, recently I discovered that the function is producing error.但是,最近我发现该函数正在产生错误。 This is what it states in the console when I run the write.xlsx() for the list:这是当我为列表运行 write.xlsx() 时它在控制台中的状态:

Error in gsub("&", "&", v, fixed = TRUE) : 
  input string 5107 is invalid UTF-8

I've identified the dataframes that are the cause of the issue, but I am not sure how to identify which part of the data frame is causing the error.我已经确定了导致问​​题的数据帧,但我不确定如何确定数据帧的哪一部分导致了错误。

I've even went ahead and used the enc2utf8() function for all of the columns in this data frame in particular but I still encounter the error.我什至继续使用enc2utf8()函数特别针对此数据框中的所有列,但我仍然遇到错误。 I've used the substr() function on the data frame itself, and it shows me the first n characters of each column, though I do not see any obvious issues from the output.我在数据框本身上使用了substr()函数,它向我显示了每列的前n字符,尽管我没有从输出中看到任何明显的问题。

I've even went ahead and used the install.packages() function to re-download the openxlsx package again, in case of any updates.我什至继续使用install.packages()函数再次重新下载openxlsx包,以防万一。

Does anyone know how I would go about identifying the cause of the error?有谁知道我将如何确定错误的原因? Is it the function as it is written in the package?它是包中写的功能吗? If the problem is in the encoding of the data itself, does the enc2utf8() not suffice to resolve the issue?如果问题出在数据本身的编码上,那么enc2utf8()是否不足以解决问题?

Thanks!谢谢!

I just had this same problem.我刚刚遇到了同样的问题。 Building on this question , you could replace all bad characters in the dataframe with either:基于此问题,您可以将数据框中的所有坏字符替换为:

library(dplyr)
df %>%
  mutate_if(is.character, ~gsub('[^ -~]', '', .))

for only character columns, or:仅用于字符列,或:

df %>%
  mutate_all(~gsub('[^ -~]', '', .))  

for all columns, and then export to XLSX with write.xlsx() .对于所有列,然后使用write.xlsx()导出到 XLSX。

As far as finding the error, the number given points you to the problem (in your case, 5107).至于发现错误,给定的数字指出了问题(在你的情况下,5107)。 This appears to be counting the strings that are written to the file.这似乎是在计算写入文件的字符串。 To find the particular data point that's the issue, this approach worked for me:为了找到问题所在的特定数据点,这种方法对我有用:

Let's assume our data frame has 20 variables and 10 of them are character type.假设我们的数据框有 20 个变量,其中 10 个是字符类型。

  • Subtract the number of variables, if you are writing the column headers (because all of those are strings) 5107-20 = 5087减去变量的数量,如果你正在写列标题(因为所有这些都是字符串)5107-20 = 5087
  • Divide the remainder by the number of character variables per observation (5087/10 = 508.7);将余数除以每次观察的字符变量数 (5087/10 = 508.7); that means that the problem is in row 509 (because there are 5080+20=5100 strings between the headers and the first 508 rows)这意味着问题出在第 509 行(因为标题和前 508 行之间有 5080+20=5100 个字符串)
  • The 7th character variable in the 509th row will be your problem child.第 509 行中的第 7 个字符变量将是您的问题孩子。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM