简体   繁体   中英

Removing backslash (escape character) from a string

I am trying to work on my own JSON parser. I have an input string that I want to tokenize:

input = "{ \\"foo\\": \\"bar\\", \\"num\\": 3}"

How do I remove the escape character \\ so that it is not a part of my tokens?

Currently, my solution using delete works:

tokens = input.delete('\\\\"').split("")

=> ["{", " ", "f", "o", "o", ":", " ", "b", "a", "r", ",", " ", "n", "u", "m", ":", " ", "3", "}"]

However, when I try to use gsub , it fails to find any \\" .

tokens = input.gsub('\\\\"', '').split("")

=> ["{", " ", "\\"", "f", "o", "o", "\\"", ":", " ", "\\"", "b", "a", "r", "\\"", ",", " ", "\\"", "n", "u", "m", "\\"", ":", " ", "3", "}"]

I have two questions:

1. Why does gsub not work in this case?

2. How do I remove the backslash (escape) character? I currently have to remove the backslash character with the quotes to make this work.

When you write:

input = "{ \"foo\": \"bar\", \"num\": 3}"

The actual string stored in input is:

{ "foo": "bar", "num": 3}

The escape \\" here is interpreted by Ruby parser, so that it can distinguish between the boundary of a string (the left most and the right most " ), and a normal character " in a string (the escaped ones).

String#delete deletes a character set specified the first parameter, rather than a pattern. All characters that is in the first parameter will be removed. So by writing

input.delete('\\"')

You got a string with all \\ and " removed from input , rather than a string with all \\" sequence removed from input . This is wrong for your case. It may cause unexpected behavior some time later.

String#gsub , however, substitute a pattern (either regular expression or plain string).

input.gsub('\\"', '')

means find all \\" (two characters in a sequence) and replace them with empty string. Since there isn't \\ in input , nothing got replaced. What you need is actually:

input.gsub('"', '')

You do not have backslashes in your string. You have quotes in your string, which need to be escaped when placed in a double-quoted string. Look:

input = "{ \"foo\": \"bar\", \"num\": 3}"
puts input
# => { "foo": "bar", "num": 3}

You are removing - phantoms.

input.delete('\\"')

will delete any characters in its argument. Thus, you delete any non-existent backslashes, and also delete all quotes. Without quotes, the default display method ( inspect ) will not need to escape anything.

input.gsub('\\"', '')

will try to delete the sequence \\" , which does not exist, so gsub ends up doing nothing.

Make sure you know what the difference between string representation ( puts input.inspect ) and string content ( puts input ) is, and note the backslashes as the artifacts of the representation.

That said, I have to echo emaillenin: writing a correct JSON parser is not simple, and you can't do it with regular expressions (or at least, not with regular regular expressions; it might be possible with Oniguruma). It needs a proper parser like treetop or rex/racc, since it has a lot of corner cases that are easy to miss (chief among them being, ironically, escaped characters).

Use regex pattern:

> input = "{ \"foo\": \"bar\", \"num\": 3}"
> input.gsub(/"/,'').split("")

> => ["{", " ", "f", "o", "o", ":", " ", "b", "a", "r", ",", " ", "n", "u", "m", ":", " ", "3", "}"]

That is actually a double quote only. The slash is to escape it.

input.gsub(/[\\"]/,"")也可以。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM