I am using nokogiri to screen scrape some HTML. In some occurrences, I am getting some weird characters back, I have tracked down the ASCII code for these characters with the following code:
@parser.leads[0].phone_numbers[0].each_byte do |c|
puts "char=#{c}"
end
The characters in question have an ASCII code of 194 and 160.
I want to somehow strip these characters out while parsing.
I have tried the following code but it does not work.
@parser.leads[0].phone_numbers[0].gsub(/160.chr/,'').gsub(/194.chr/,'')
Can anyone tell me how to achieve this?
I found this question while trying to strip out invisible characters when "trimming" a string.
s.strip
did not work for me and I found that the invisible character had the ord
number 194
None of the methods above worked for me but then I found " Convert non-breaking spaces to spaces in Ruby " question which says:
Use
/\ /
to match non-breaking spaces:s.gsub(/\ /, ' ')
converts all non-breaking spaces to regular spacesUse
/[[:space:]]/
to match all whitespace, including Unicode whitespace like non-breaking spaces. This is unlike/\\s/
, which matches only ASCII whitespace.
So glad I found that! Now I'm using:
s.gsub(/[[:space:]]/,'')
This doesn't answer the question of how to gsub
specific character codes, but if you're just trying to remove whitespace it seems to work pretty well.
Your problem is that you want to do a method call but instead you're creating a Regexp. You're searching and replacing strings consisting of the string "160" followed by any character and then the string "chr", and then doing the same except with "160" replaced with "194".
Instead, do gsub(160.chr, '')
.
Update (2018): This code does not work in current Ruby versions. Please refer to other answers.
You can also try
s.gsub(/\xA0|\xC2/, '')
or
s.delete 160.chr+194.chr
First thought would be should you be using gsub! instead of gsub
gsub returns a string and gsub! performs the substitution in place
I was getting "invalid multibyte escape" error while trying the above solution, but for a different situation. Google was return \\xA0 when the number is greater than 999 and I wanted to remove it. So what I did was use return_value.gsub(/[\\xA0]/n,"") instead and it worked perfectly fine for me.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.