简体   繁体   中英

match a double quoted-string with double-quote inside

I have this python string:

string = '"/dev/null" "" "19/1333329478.9381399" 0 1 "cam-foo" 64 900.0 "Foo x rev scan of test" "/usr/bin/env ""PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH"" python app.py" 60.145855 2.034689'

I need a regex that gives me a list of every element in this string. Element: any number or any string contained in a double quote. A string can contain a double quote.

I've come out with this regex:

import re    
p = re.compile(r'"[^"]*"|[-\.\d]+')
p.findall(string)
['"/dev/null"', '""', '"19/1333329478.9381399"', '0', '1', '"cam-foo"', '64', '900.0', '"Foo x rev scan of test"', '"/usr/bin/env "', '"PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH"', '" python app.py"', '60.145855', '2.034689']

As you can see I miss the part of double-quote inside the string. Double-quote inside an element should be ignored. I'd like to have this result:

['"/dev/null"', '""', '"19/1333329478.9381399"', '0', '1', '"cam-foo"', '64', '900.0', '"Foo x rev scan of test"', '"/usr/bin/env ""PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH"" python app.py"', '60.145855', '2.034689']

Instead to have 3 (or more) elements

[..., '"/usr/bin/env "', '"PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH"', '" python app.py"', ...]

I'd like to have only one element:

'"/usr/bin/env ""PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH"" python app.py"'

Anyone can help me?

The first half of your regular expression currently matches a pair of double quotes surrounding zero or more non-double-quote characters.

r'"[^"]*"'

You can achieve your desired result by changing which strings you match inside the surrounding double quotes.

r'"(?:[^"]|"")*"'

This regular expression matches a pair of double quotes that surround zero or more strings; each string must consist of either one non-double-quote character or two consecutive double quotes. (The ?: marks the parenthesized bit as a non-capturing group; otherwise Python will only return the bit inside the parentheses.)

Let's plug that into your complete regex:

% python
Python 2.7.2 (default, Mar 20 2012, 13:27:18) 
[GCC 4.2.1 Compatible Apple Clang 3.1 (tags/Apple/clang-318.0.54)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> s = '"/dev/null" "" "19/1333329478.9381399" 0 1 "cam-foo" 64 900.0 "Foo x rev scan of test" "/usr/bin/env ""PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH"" python app.py" 60.145855 2.034689'
>>> for el in re.findall(r'"(?:[^"]|"")*"|[-\.\d]+', s): print(el)
... 
"/dev/null"
""
"19/1333329478.9381399"
0
1
"cam-foo"
64
900.0
"Foo x rev scan of test"
"/usr/bin/env ""PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH"" python app.py"
60.145855
2.034689
>>>

You could use csv module .

Example

>>> import csv
>>> from pprint import pprint
>>> pprint(list(csv.reader([string], delimiter=' ', quotechar='"')))
[
[
'/dev/null'
,
''
,
'19/1333329478.9381399'
,
'0'
,
'1'
,
'cam-foo'
,
'64'
,
'900.0'
,
'Foo x rev scan of test'
,
'/usr/bin/env "PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH" python app.py'
,
'60.145855'
,
'2.034689'
]
]

If all you need is to be able to split this exact case, you can use shlex.split() :

>>> import shlex
>>> s = '"/dev/null" "" "19/1333329478.9381399" 0 1 "cam-foo" 64 900.0 "Foo x rev scan of test" "/usr/bin/env ""PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH"" python app.py" 60.145855 2.034689'
>>> shlex.split(s)
['/dev/null', '', '19/1333329478.9381399', '0', '1', 'cam-foo', '64', '900.0', 'Foo x rev scan of test', '/usr/bin/env PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH python app.py', '60.145855', '2.034689']
>>> shlex.split(s)[-3]
'/usr/bin/env PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH python app.py'

It's not regex, but it will solve this exact problem for you every time.

Enclose the regex search token in (). What happens is, re will nor return a list for each find. Pick the right array element. Eg:

m = p.findall(string)

Will return a list in m, whose each element is again a tokenised list according to what was enclosed in your (). This way you can retrieve the exact part of the statement that you desire.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM