简体   繁体   中英

gsub ASCII code characters from a string in ruby

I am using nokogiri to screen scrape some HTML. In some occurrences, I am getting some weird characters back, I have tracked down the ASCII code for these characters with the following code:

  @parser.leads[0].phone_numbers[0].each_byte  do |c|
    puts "char=#{c}"
  end

The characters in question have an ASCII code of 194 and 160.

I want to somehow strip these characters out while parsing.

I have tried the following code but it does not work.

@parser.leads[0].phone_numbers[0].gsub(/160.chr/,'').gsub(/194.chr/,'')

Can anyone tell me how to achieve this?

I found this question while trying to strip out invisible characters when "trimming" a string.

s.strip did not work for me and I found that the invisible character had the ord number 194

None of the methods above worked for me but then I found " Convert non-breaking spaces to spaces in Ruby " question which says:

Use /\ / to match non-breaking spaces: s.gsub(/\ /, ' ') converts all non-breaking spaces to regular spaces

Use /[[:space:]]/ to match all whitespace, including Unicode whitespace like non-breaking spaces. This is unlike /\\s/ , which matches only ASCII whitespace.

So glad I found that! Now I'm using:

s.gsub(/[[:space:]]/,'')

This doesn't answer the question of how to gsub specific character codes, but if you're just trying to remove whitespace it seems to work pretty well.

Your problem is that you want to do a method call but instead you're creating a Regexp. You're searching and replacing strings consisting of the string "160" followed by any character and then the string "chr", and then doing the same except with "160" replaced with "194".

Instead, do gsub(160.chr, '') .

Update (2018): This code does not work in current Ruby versions. Please refer to other answers.

You can also try

s.gsub(/\xA0|\xC2/, '')

or

s.delete 160.chr+194.chr

First thought would be should you be using gsub! instead of gsub

gsub returns a string and gsub! performs the substitution in place

I was getting "invalid multibyte escape" error while trying the above solution, but for a different situation. Google was return \\xA0 when the number is greater than 999 and I wanted to remove it. So what I did was use return_value.gsub(/[\\xA0]/n,"") instead and it worked perfectly fine for me.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM