简体   繁体   English

删除 R 中的不间断空格字符

[英]Removing non-breaking space characters in R

I have dataframe with several columns and 50K plus observations.我有 dataframe,其中包含多个列和 50K 以上的观察值。 Let's name it df1.我们将其命名为 df1。 One of the variables is PLATES (denoted here as "y"), which contains plate numbers of buses in a city.其中一个变量是 PLATES(此处表示为“y”),它包含城市中公交车的车牌号。 I want to match this data frame with another(df2) where I also have plates data.我想将此数据框与另一个(df2)匹配,其中我也有车牌数据。 I want to keep matching records only.我只想保留匹配记录。 While looking at the data in df1, which comes from a CSV file, I realized that for y, several observations had symbols before the plate number that correspond to non-breaking space.在查看来自 CSV 文件的 df1 中的数据时,我意识到对于 y,几个观察值在车牌号之前有对应于不间断空格的符号。 How do I get rid of this so that it isn't an issue when I do the matching.我该如何摆脱它,以便在我进行匹配时这不是问题。 Here's some code to help illustrate.这里有一些代码可以帮助说明。 Let's say you have 5 plate numbers:假设您有 5 个车牌号:

y <- c(0740170, 0740111, 0740119, 0740115, 0740048)

But upon further inspection但经过进一步检查

view(y)

You see the following你看到以下内容

<c2><a0>0740170
<c2><a0>0740111
<c2><a0>0740119
<c2><a0>0740115
<c2><a0>0740048

I tried this, from this post https://blog.tonytsai.name/blog/2017-12-04-detecting-non-breaking-space-in-r/ , but didn't work我从这篇文章https://blog.tonytsai.name/blog/2017-12-04-detecting-non-breaking-space-in-r/ 试过了,但没有用

y <- gsub("\u00A0", " ", y, fixed = TRUE)

I would appreciate a lot your help on how to deal with this issue.非常感谢您就如何处理此问题提供的帮助。 Thanks!谢谢!

Not quite sure this will help as I can't test my answer (as I can't recreate your problem).不太确定这会有所帮助,因为我无法测试我的答案(因为我无法重现您的问题)。 But if non-breaking space characters are at the same time non-ASCII characters then, the solution would be this:但是,如果不间断空格字符同时是非 ASCII 字符,那么解决方案是:

y <- gsub("[^ -~]+", "", y)

The pattern matches any non-ASCII characters and the replacement sets them to null.该模式匹配任何非 ASCII 字符,替换将它们设置为 null。 Hope this helps希望这可以帮助

The other answer matches any non-ASCII character but what if you need to keep non-ASCII characters eg letters with accents?另一个答案匹配任何非 ASCII 字符,但是如果您需要保留非 ASCII 字符(例如带有重音符号的字母)怎么办? In this situation I wanted to match specifically a non-breaking space of type <c2><a0> as in the question.在这种情况下,我想专门匹配问题中<c2><a0>类型的不间断空格。 What worked for me was matching \xa0对我有用的是匹配\xa0

test # nbsp between type and II
# [1] "Diabète de type II"
tools::showNonASCII(test) 
# 1: Diab<c3><a8>te de type<c2><a0>II

# other answer
gsub("[^ -~]+", " ", test) # has missing è
# [1] "Diab te de type II"
tools::showNonASCII(gsub("[^ -~]+", " ", test))# no output as no non-ascii chars left

gsub("\xa0+", " ", test)
# [1] "Diabète de type II"
tools::showNonASCII(gsub("\xa0+", " ", test)) # the <c2><a0> nbsp is replaced
# 1: Diab<c3><a8>te de type II

Hat tip to http://www.pmean.com/posts/non-breaking-space/http 致敬://www.pmean.com/posts/non-breaking-space/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM