简体   繁体   中英

Regex: Given a string, find substring in double quotes and substring not in double quotes

For example:

If the string is ' "normal" script ' - output should show that substring normal is in double quotes whereas the substring script is not.

To trace double quoted substring from a string, I tried with the regex:

r' "([^"]*)" '

We can use split() method to get the substring without double quotes but I'm looking for an efficient approach.

Below is the code which I've tried - it returns list of substrings which are double quoted.

import re
def demo(text):      
    matches = re.findall(r'"([^"]*)"', text)
    return matches

a = demo('"normal" string "is here"')
print(a)

Apart from finding double quoted substrings I'm also looking for substrings which are not double quoted.

For example, output of demo('"normal" string "is here"') should be:

double quoted: ['normal', 'is here']

non double quoted: ['string']

You can search for both quoted and double-quoted strings in the same regular expression.

import re

def dequote(s):
    return re.findall(r'(?:"([^"]*)")|([^"]*)', s)

print(dequote('"normal" script'))
print(dequote('another "normal" script with "extra words in it"'))

Notice returned list of tuples contains both quoted and non-quoted strings. The quoted strings are in the first element of the tuples, the non-quoted strings are in the second element.

If you want the lists separate, it is a simple matter to separate them.

result = dequote('another "normal" script with "extra words in it"')

result_quoted = [t[0].strip() for t in result if t[0]]
result_unquoted = [t[1].strip() for t in result if t[1]]

print("double quoted: {}\nnot double quoted{}".format(
    result_quoted, result_unquoted))

The output of the entire program:

$ python x.py 
[('normal', ''), ('', ' script'), ('', '')]
[('', 'another '), ('normal', ''), ('', ' script with '), ('extra words in it', ''), ('', '')]
double quoted: ['normal', 'extra words in it']
not double quoted['another', 'script with']

Note that you imply that a re -based solution will go faster than one based on str.split() . I'm not convinced of that. Consider these two solutions:

def dequote_re(s):
    result = re.findall(r'(?:"([^"]*)")|([^"]*)', s)
    result_quoted = [t[0].strip() for t in result if t[0]]
    result_unquoted = [t[1].strip() for t in result if t[1]]
    return result_quoted, result_unquoted

def dequote_split(s):
    result = s.split('"')
    result_unquoted = [item.strip() for item in result[0::2] if item]
    result_quoted = [item.strip() for item in result[1::2] if item]
    return result_quoted, result_unquoted

They give the same answers. Perhaps you should run timeit to find which is faster for you.

With regex module:

>>> import re, regex
>>> s='"normal" string "is here"'

>>> re.findall(r'"([^"]*)"', s)
['normal', 'is here']

# change \w to appropriate character class as needed
>>> regex.findall(r'"[^"]*"(*SKIP)(*F)|\w+', s)
['string']

# or a workaround, remove double quoted strings first
>>> re.findall(r'\w+', re.sub(r'"([^"]*)"', '', s))
['string']

See Using (*SKIP)(*FAIL) to Exclude Unwanted Matches for detailed explanation. To put it simply, append (*SKIP)(*F) to regex you want to exclude and using alternation define the ones you need

I know that split() is the fastest and replace() is faster then regex, so:

output = '"normal" script'.replace('"', '').split()

Output: ['normal', 'script']

Execution Time: 3.490e-05 seconds Using regex you get time beetwen 0.2e-04 and 0.3e-04

If you have quite big string you may use regex to figure the occurrences and manage to break it in smaller pieces (depends what you expect to get and from where).

It seems your substrings are words. For the double quoted or non double quoted strings you can split by substrings and ititerate as a list.

Spliting by double quoted or non double quoted may require for creating two lists.

Spliting by words you can create a single list of words and cheking the double quotation on outputing it.

Both of cases may cost almost the same, depending of the size of string you get.

I recommend using the https://regexr.com and try to get as most you can pieces of the string you may treat.

My Best.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM