如何从Ruby中的字符串中删除所有非ASCII字符

Question

I seems to be a very simple and much needed method. 我似乎是一个非常简单和非常需要的方法。 I need to remove all non ASCII characters from a string. 我需要从字符串中删除所有非ASCII字符。 eg Â© etc. See the following example. 例如©等。请参阅以下示例。

#coding: utf-8
s = " Hello this a mixed string Â© that I made."
puts s.encoding
puts s.encode

output: 输出：

UTF-8
Hello this a mixed str

ing ┬⌐ that I made. 我做的。

When I feed this to Watir, it produces following error:incompatible character encodings: UTF-8 and ASCII-8BIT 当我将其提供给Watir时，会产生以下错误：不兼容的字符编码：UTF-8和ASCII-8BIT

So my problem is that I want to get rid of all non ASCII characters before using it. 所以我的问题是我想在使用它之前去除所有非ASCII字符。 I will not know which encoding the source string "s" uses. 我不知道源字符串“s”使用哪种编码。

I have been searching and experimenting for quite some time now. 我一直在搜索和试验很长一段时间。

If I try to use 如果我尝试使用

  puts s.encode('ASCII-8BIT')

It gives the error: 它给出了错误：

 : "\xC2\xA9" from UTF-8 to ASCII-8BIT (Encoding::UndefinedConversionError)

Answer 1

You can just literally translate what you asked into a Regexp . 您可以直接将您要求的内容翻译成Regexp 。 You wrote: 你写了：

I want to get rid of all non ASCII characters 我想摆脱所有非ASCII字符

We can rephrase that a little bit: 我们可以稍微改写一下：

I want to substitue all characters which don't thave the ASCII property with nothing 我想替换所有不具有ASCII属性的ASCII

And that's a statement that can be directly expressed in a Regexp : 这是一个可以在Regexp 直接表达的声明：

s.gsub!(/\P{ASCII}/, '')

As an alternative, you could also use String#delete! 作为替代方案，您也可以使用String#delete! : ：

s.delete!("^\u{0000}-\u{007F}")

Answer 2

Strip out the characters using regex. 使用正则表达式删除字符。 This example is in C# but the regex should be the same: How can you strip non-ASCII characters from a string? 这个例子在C＃中，但正则表达式应该是相同的：如何从字符串中删除非ASCII字符？ (in C#) （在C＃中）

Translating it into ruby using gsub should not be difficult. 使用gsub将其翻译成ruby并不困难。

Answer 3

UTF-8 is a variable-length encoding. UTF-8是可变长度编码。 When a character occupies one byte, its value coincides with 7-bit ASCII. 当一个字符占用一个字节时，其值与7位ASCII一致。 So why don't you just look for bytes with a '1' in the MSB, and then remove both them and their trailers? 那么为什么不在MSB中查找带有“1”的字节，然后删除它们和它们的预告片？ A byte beginning with '110' will be followed by one additional byte. 以“110”开头的字节后面将跟着一个额外的字节。 A byte beginning with '1110' will be followed by two. 以“1110”开头的字节后面跟着两个。 And a byte beginning with '11110' will be followed by three, the maximum supported by UTF-8. 一个以'11110'开头的字节后面跟着三个，UTF-8支持的最大值。

This is all just off the top of my head. 这一切都在我的头顶。 I could be wrong. 我错了。

如何从Ruby中的字符串中删除所有非ASCII字符

问题描述

3 个解决方案

解决方案1
36 2010-07-08 09:07:12

解决方案2
2 2010-07-08 04:13:05

解决方案3
1 2010-07-08 04:10:57

如何从Ruby中的字符串中删除所有非ASCII字符

问题描述

3 个解决方案

解决方案1 36 2010-07-08 09:07:12

解决方案2 2 2010-07-08 04:13:05

解决方案3 1 2010-07-08 04:10:57

解决方案1
36 2010-07-08 09:07:12

解决方案2
2 2010-07-08 04:13:05

解决方案3
1 2010-07-08 04:10:57