简体   繁体   English

R-gsub从字符串中删除标点符号和数字

[英]R - gsub to remove punctuation & numbers from string

I'm trying to remove punctuation and digits from <U+200B>Chandler to become Chandler . 我正在尝试从<U+200B>Chandler删除标点符号和数字以成为Chandler This is what I'm currently trying: 这是我目前正在尝试的方法:

df$city <- gsub("[[:punct:]]|[[:digit:]]", "", df$city)

However, it doesn't do anything to change the cell in column 'city' in 'df'. 但是,它不会改变“ df”中“ city”列中的单元格。 When I search typeof(df), I get 'list'. 当我搜索typeof(df)时,我得到“列表”。 This might have to do with it? 这可能与它有关吗?

Any help would be greatly appreciated. 任何帮助将不胜感激。

Second question first, tyepof() will always return list for a data frame, because data frames are really just lists of equal length vectors . 首先要问的第二个问题是, tyepof()将始终返回数据帧的list ,因为数据帧实际上只是等长向量的列表

For the first question, it appears you have some Unicode encoded characters in your data. 对于第一个问题,似乎您的数据中包含一些Unicode编码的字符。 One good way to take care of these is to convert them, perhaps like: 照顾这些的一种好方法是将它们转换,例如:

df$city <- iconv(df$city, 'utf-8', 'ascii', sub = '')

It is also possible to gsub out characters on their hex code, like this: 也可以在其十六进制代码中gsub出字符,如下所示:

df$city <- gsub('\u200B', '', df$city)

or even a range: 甚至范围:

df$city <- gsub('[\u2000-\u20ff]', '', df$city)

But really I think the iconv approach is the way to go. 但实际上我认为iconv方法是iconv方法。 In this usage it will just remove the character rather than render it, but that seems to be what you want. 在这种用法中,它只会删除字符而不是渲染它,但这似乎就是您想要的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM