[英]Remove accents from a dataframe column in R
I got a data.table base.我得到了一个 data.table 基础。 I got a term column in this data.table
我在这个 data.table 中有一个术语列
class(base$term)
[1] character
length(base$term)
[1] 27486
I'm able to remove accents from a string.我能够从字符串中删除重音符号。 I'm able to remove accents from a vector of string.
我能够从字符串向量中删除重音符号。
iconv("Millésime",to="ASCII//TRANSLIT")
[1] "Millesime"
iconv(c("Millésime","boulangère"),to="ASCII//TRANSLIT")
[1] "Millesime" "boulangere"
But for some reason, it does not work when I apply the very same function on my term column但是由于某种原因,当我在我的术语列上应用完全相同的功能时它不起作用
base$terme[2]
[1] "Millésime"
iconv(base$terme[2],to="ASCII//TRANSLIT")
[1] "MillACsime"
Does anybody know what is going on here?有人知道这里发生了什么吗?
Ok the way to solve the problem :好的解决问题的方法:
Encoding(base$terme[2])
[1] "UTF-8"
iconv(base$terme[2],from="UTF-8",to="ASCII//TRANSLIT")
[1] "Millesime"
Thanks to @nicola感谢@nicola
It might be easier to use the stringi package.使用stringi包可能更容易。 This way, you don't need to check the encoding beforehand.
这样,您无需事先检查编码。 Furthermore stringi is consistent across operating systems and
inconv
is not.此外, stringi在操作系统之间是一致的,而
inconv
则不是。
library(stringi)
base <- data.table(terme = c("Millésime",
"boulangère",
"üéâäàåçêëèïîì"))
base[, terme := stri_trans_general(str = terme,
id = "Latin-ASCII")]
> base
terme
1: Millesime
2: boulangere
3: ueaaaaceeeiii
You can apply this function您可以应用此功能
rm_accent <- function(str,pattern="all") {
if(!is.character(str))
str <- as.character(str)
pattern <- unique(pattern)
if(any(pattern=="Ç"))
pattern[pattern=="Ç"] <- "ç"
symbols <- c(
acute = "áéíóúÁÉÍÓÚýÝ",
grave = "àèìòùÀÈÌÒÙ",
circunflex = "âêîôûÂÊÎÔÛ",
tilde = "ãõÃÕñÑ",
umlaut = "äëïöüÄËÏÖÜÿ",
cedil = "çÇ"
)
nudeSymbols <- c(
acute = "aeiouAEIOUyY",
grave = "aeiouAEIOU",
circunflex = "aeiouAEIOU",
tilde = "aoAOnN",
umlaut = "aeiouAEIOUy",
cedil = "cC"
)
accentTypes <- c("´","`","^","~","¨","ç")
if(any(c("all","al","a","todos","t","to","tod","todo")%in%pattern)) # opcao retirar todos
return(chartr(paste(symbols, collapse=""), paste(nudeSymbols, collapse=""), str))
for(i in which(accentTypes%in%pattern))
str <- chartr(symbols[i],nudeSymbols[i], str)
return(str)
}
Three ways to remove accents - shown and compared to each other below.三种去除重音的方法 - 下面显示并相互比较。
The data to play with:要玩的数据:
dtCases <- fread("https://github.com/ishaberry/Covid19Canada/raw/master/cases.csv", stringsAsFactors = F )
dim(dtCases) # 751526 16
Bench-marking:基准测试:
> system.time(dtCases [, city0 := health_region])
user system elapsed
0.009 0.001 0.012
> system.time(dtCases [, city1 := base::iconv (health_region, to="ASCII//TRANSLIT")]) # or ... iconv (health_region, from="UTF-8", to="ASCII//TRANSLIT")
user system elapsed
0.165 0.001 0.200
> system.time(dtCases [, city2 := textclean::replace_non_ascii (health_region)])
user system elapsed
9.108 0.063 9.351
> system.time(dtCases [, city3 := stringi::stri_trans_general (health_region,id = "Latin-ASCII")])
user system elapsed
4.34 0.00 4.46
Result:结果:
> dtCases[city0!=city1, city0:city3] %>% unique
city0 city1 city2 city3
<char> <char> <char> <char>
1: Montréal Montreal Montreal Montreal
2: Montérégie Monteregie Monteregie Monteregie
3: Chaudière-Appalaches Chaudiere-Appalaches Chaudiere-Appalaches Chaudiere-Appalaches
4: Lanaudière Lanaudiere Lanaudiere Lanaudiere
5: Nord-du-Québec Nord-du-Quebec Nord-du-Quebec Nord-du-Quebec
6: Abitibi-Témiscamingue Abitibi-Temiscamingue Abitibi-Temiscamingue Abitibi-Temiscamingue
7: Gaspésie-Îles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine
8: Côte-Nord Cote-Nord Cote-Nord Cote-Nord
Conclusion:结论:
The base::iconv()
is the fastest and preferred method. base::iconv()
是最快和首选的方法。 Tested on French words.测试法语单词。 Not tested on other languages.
未在其他语言上测试。
Here is an version of Jeldrik's solution revised for DataFrames.这是针对 DataFrames 修订的 Jeldrik 解决方案的一个版本。 Note the
:=
operator is deprecated in base R.请注意
:=
运算符在基础 R 中已弃用。
library(stringi)
base <- data.frame(terme = c("Millésime",
"boulangère",
"üéâäàåçêëèïîì"))
base$terme = stri_trans_general(str = base$terme, id = "Latin-ASCII")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.