简体   繁体   中英

removing unicode characters from data frame in r

I have a large data frame which contains farsi characters I import it in r with this code:

Sys.setlocale(locale = "persian")
dt <- read.csv("data.csv",encoding="UTF-8")

and my dt is like this:

id         title
3376971    چوب شور آلبينا شيرين عسل <U+06F3><U+06F0> گرمي
3376989    ويفر رنگارنگ مينو <U+06F1><U+06F4>.<U+06F5> گرمي
3376990    کوکي مينو <U+06F3><U+06F0> گرمي
3376991    بيسکويت هاي باي شيرين عسل <U+06F3><U+06F8> گرمي
3376992    شکلات توريستي آناتا <U+06F2><U+06F8> گرمي
3376993    اسنک مغزدار شکلاتي شونيز <U+06F1><U+06F7> گرمي
3376994    شکلات فندقي آناتا <U+06F1><U+06F8> گرمي
3376995    نان روغني شيرين عسل <U+06F5><U+06F0> گرمي
3376996    بيسکويت هاي باي شيرين عسل <U+06F5><U+06F7> گرمي

There are some unicode which I'd like to remove, I have tried:

dt<- cbind.data.frame(dt$id,gsub("<.+>", "", dt$title)
dt<- cbind.data.frame(dt$id,gsub("\\S+\\s+|-", "", dt$title)
dt<- cbind.data.frame(dt$id,gsub("^\\s*<U\\+\\w+>\\s*", "", dt$title)
dt<- cbind.data.frame(dt$id,gsub("\\<U[^\\>]*\\>", "", dt$title)  
dt<- cbind.data.frame(dt$id,gsub(""▼|▲"", "", dt$title)  

but non of them works

I also tried this:

dt$title<-gsub("^\\s*<U\\+\\w+>\\s*", "", dt$title)

but I got this error:

Error in `$<-.data.frame`(`*tmp*`, title, value = character(0)) : 
replacement has 0 rows, data has 66366

I noticed that in R console my data are shown like this:

چوب شور آلبینا شیرین عسل ۳۰ گرمی

and the Unicode has been shown like Persian numbers and I tried this and it worked:

dt$title<-gsub("[۰-۹]+", "", dt$title)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM