简体   繁体   English

在Ruby中随机生成有效的unicode字符

[英]Random generate a valid unicode character in Ruby

How can I generate a random unicode string consisting of a given number of unicode characters in Ruby? 如何在Ruby中生成由给定数目的unicode字符组成的随机unicode字符串?

The following works, but includes control characters (0x00-0x1F, etc.) for instance: 下面的作品,但包括例如控制字符(0x00-0x1F等):

20.times.map{ Random.rand(0xFFFF).chr('UTF-8')}.join

A lot of the characters in that range are not printable (as you've noted) or they are surrogate, custom, or otherwise invalid characters. 该范围内的许多字符不可打印(如您所述),或者它们是替代字符,自定义字符或其他无效字符。 The best approach (that I can think of) is to generate a sequence of characters, test each to make sure it's valid and printable, and then take the first 20 of them. 最好的方法(我能想到的)是生成一个字符序列,测试每个字符以确保其有效且可打印,然后获取其中的前20个字符。

A few notes. 一些注意事项。 We want to do rand(0x10000) in this case, not rand(0xFFFF) , because Random#rand and Kernel#rand will return a number less than its argument, and you want to include U+FFFF in your sampling. 在这种情况下,我们要执行rand(0x10000) ,而不是rand(0xFFFF) ,因为Random#randKernel#rand将返回小于其参数的数字,并且您想在采样中包括U + FFFF。 We should also give ourselves some flexibility to do one-byte, two-byte, three-byte, or four-byte UTF-8. 我们还应该给自己一些灵活性以执行一字节,两字节,三字节或四字节的UTF-8。

Let's start with a basic sequence generator, called an Enumerator in Ruby. 让我们从一个基本的序列生成器开始,在Ruby中称为Enumerator This object yields values, one a time, and can represent a finite or infinite sequence. 该对象一次产生一个值,并且可以表示一个有限或无限的序列。 In this case, we want to enumerate an infinite sequence of random, three-byte UTF-8 characters, skipping invalid characters as we go. 在这种情况下,我们想枚举一个无限的随机三字节UTF-8字符序列,同时跳过无效字符。

random_utf8 = Enumerator.new do |yielder|
  loop do
    yielder << rand(0x10000).chr('UTF-8')
  rescue RangeError
  end
end

You can pull values off of the Enumerator with #next to see it in action: 您可以使用#next从Enumerator中提取值以查看其作用:

irb(main):007:0> random_utf8.next
=> "\u9FEB"
irb(main):008:0> random_utf8.next
=> "槇"
irb(main):009:0> random_utf8.next
=> "엛"

(You'll notice that one of them didn't "render" because it's not a printable character. This illustrates why we need to filter the values before selecting 20 of them.) (您会注意到其中一个没有“渲染”,因为它不是可打印的字符。这说明了为什么我们需要在选择20个值之前过滤这些值。)

Now we can take characters off this sequence and check each one to see if it's printable. 现在我们可以取消此序列中的字符,并检查每个字符是否可打印。 The only catch is that we want to do this lazily , to avoid checking every character in the infinite sequence (which is impossible) before moving on to the next step in the chain. 唯一要注意的是,我们要懒惰地执行此操作,以避免在继续进行下一步之前避免检查无限序列中的每个字符(这是不可能的)。 Finally, we'll take the first 20 printable characters and join them together into a string. 最后,我们将获取前20个可打印字符,并将它们连接成一个字符串。

random_utf8
  .lazy
  .grep(/[[:print:]]/) # or [[:alpha:]] or \p{L} or whatever test you want here
  .first(20)
  .join # => "醸긍ᅋꝇ꼏捁㨃농鳹䝛ㆅ⇂擒璝缀챼砶"

Now let's abstract this into a method so we can parameterize some things. 现在让我们将其抽象为一个方法,以便我们可以对一些东西进行参数化。 Ruby gives us a neat way to return an Enumerator from a method that yields values by returning Object#enum_for (aka Object#to_enum ) with the method symbol and any other arguments passed to the function. Ruby通过返回带有方法符号和传递给函数的任何其他参数的Object#enum_for (aka Object#to_enum ),为从产生值的方法返回枚举器的一种巧妙方法。

def random_utf8(mb=3)
  return enum_for(__callee__, mb) unless block_given?

  # determine the maximum codepoint based on the number of UTF-8 bytes
  max = [0x80, 0x800, 0x10000, 0x110000][mb.pred]

  loop do
    yield rand(max).chr('UTF-8') # note the `yield` here
  rescue RangeError
  end
end

We can use this method exactly the same way we used our Enumerator above, optionally passing in the number of UTF-8 bytes desired. 我们可以使用与上面的枚举器完全相同的方法来使用此方法,可以选择传入所需的UTF-8字节数。

This approach also gives us the option to call our method with a block instead of chaining operations off of it: 这种方法还为我们提供了使用块调用方法的选项,而不是将操作链接到块之外:

random_utf8(2) do |char|
  next unless char.match?(/[[:print:]]/)

  puts "Got >#{char}<!"

  break # don't loop infinitely
end

Which, admittedly, is not very useful in this particular case. 诚然,在这种特殊情况下,这不是很有用。

One additional note about the implementation of this solution: You could easily move the printable check into the method body, or move the RangeError exception handling out of the method body. 关于此解决方案的实现的其他说明:您可以轻松地将可打印的检查移到方法主体中,或者将RangeError异常处理移出方法主体。 You can also have the method return a lazy Enumerator by default. 您还可以让该方法默认返回一个惰性枚举器。 It's really up to you to design the method around your application requirements. 真正取决于您的应用程序需求来设计方法。

def lazy_printable_random_utf8(mb=3)
  return enum_for(__callee__, mb).lazy unless block_given?

  # determine the maximum codepoint based on the number of UTF-8 bytes
  max = [0x80, 0x800, 0x10000, 0x110000][mb.pred]

  loop do
    char = rand(max).chr('UTF-8')

    yield char if char.match?(/[[:print:]]/)
  rescue RangeError
  end
end

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM