简体   繁体   English

如何在Ruby中删除空格但不删除utf-8字符

[英]how to remove whitespace but not utf-8 character in ruby

I want to prevent users to write an empty comment (whitespaces,   , etc.). 我想防止用户写空评论(空格,  等)。 so I apply the following: 所以我应用以下内容:

var.gsub(/^\s+|\s+\z|\s* \s*/.'')

However, then a smart user find a hole by using \\302 or \\240 unicode characters so I filtered out these characters too. 但是,然后一个聪明的用户通过使用\\302\\240 unicode字符找到了一个漏洞,因此我也过滤掉了这些字符。

Then I ran into problem as I introduced several languages support, then a word like Déjà vu becomes an error. 然后在介绍几种语言支持时遇到了问题,然后像Déjà vu这样的词就变成了错误。 because part of the à character contains \\240 . 因为à字符的一部分包含\\240 is there any way to remove the whitespaces but leave the latin characters untouched? 有什么方法可以删除空白但不影响拉丁字符?

A way around this is to use iconv to discard the invalid unicode characters (such as \\230 on its own) before using your regexp to remove the whitespaces: 一种解决方法是在使用正则表达式删除空白之前,使用iconv丢弃无效的unicode字符(例如\\230本身):

require 'iconv'

var1 = "Déjà vu"
var2 = "\240"

ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid1 = ic.iconv(var1) # => "D\303\251j\303\240 vu" 
valid2 = ic.iconv(var2) # => ""

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM