简体   繁体   English

使用正则表达式从除撇号之外的字符串中去除所有字符和标点符号

[英]Using regex to strip all characters and punctuation from a string except apostrophe

I attempted to let this method call: 我试图让这个方法调用:

alternate_words(". . . .  don’t let this stop you")

return every other word in the string, less punctuations except for ' . 返回字符串中的每个其他单词,除了'之外更少的标点符号。

This is the method definition: 这是方法定义:

def alternate_words(sentence)
  sentence.gsub(/[^a-z0-9\s']/i, "").split(" ").delete_if.with_index 
  {|word,index| index.odd? }
end

The result is: 结果是:

["dont", "this", "you"]

The correct words are returned, but no ' is included. 返回正确的单词,但不包括' Changing the regex to: 将正则表达式更改为:

/[^a-z0-9\s][']/i

returns 回报

[".", ".", "don’t", "this", "you"]

Now, it correctly recognizes the apostrophe, but it incorrectly includes the periods. 现在,它正确识别撇号,但它错误地包含了句点。 I don't understand why. 我不明白为什么。

You may actually match words with apostrophes and hyphens with scan : 实际上,您可以将带有撇号和连字符的单词与scan 匹配

def alternate_words(sentence)
  sentence.scan(/[[:alnum:]]+(?:[’'-][[:alnum:]]+)*/).delete_if.with_index { |_,index| 
    index.odd? 
  }
end

p alternate_words(". . . . .  don’t let this stop you")
# => ["don’t", "this", "you"]

See a Ruby demo 查看Ruby演示

The [[:alnum:]]+(?:[''-][[:alnum:]]+)* pattern may be enclosed with a word boundary - \\b - if you want to only match whole word. [[:alnum:]]+(?:[''-][[:alnum:]]+)*模式可以用单词边界括起来 - \\b - 如果你只想匹配整个单词。

Details : 细节

  • [[:alnum:]]+ - 1 or more alphanumeric symbols [[:alnum:]]+ - 一个或多个字母数字符号
  • (?:[''-][[:alnum:]]+)* - zero or more (due to * , replace with another quantifier as per requirements) occurrences of: (?:[''-][[:alnum:]]+)* - 零或更多(由于* ,根据要求替换为另一个量词)出现的次数:
    • [''-] - an apostrophe or a hyphen (the list may be adjusted_ [''-] - 撇号或连字符(列表可能会被调整_
    • [[:alnum:]]+ - 1 or more alphanumeric symbols. [[:alnum:]]+ - 一个或多个字母数字符号。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM