正則表達式：給定一個字符串，請在雙引號中查找子字符串，而不在雙引號中查找子字符串

Question

例如：

如果字符串是'“ normal” script'-輸出應顯示substring normal用雙引號引起來，而substring script不是。

為了跟蹤字符串中雙引號的子字符串，我嘗試使用正則表達式：

r'“（[[^”] *）“'

我們可以使用split()方法獲取不帶雙引號的子字符串，但我正在尋找一種有效的方法。

下面是我嘗試過的代碼-它返回雙引號的子字符串列表。

import re
def demo(text):      
    matches = re.findall(r'"([^"]*)"', text)
    return matches

a = demo('"normal" string "is here"')
print(a)

除了找到雙引號的子字符串外，我還在尋找沒有雙引號的子字符串。

例如， demo('"normal" string "is here"')應為：

雙引號： ['normal', 'is here']

非雙引號： ['string']

Answer 1

您可以在同一正則表達式中搜索帶引號和雙引號的字符串。

import re

def dequote(s):
    return re.findall(r'(?:"([^"]*)")|([^"]*)', s)

print(dequote('"normal" script'))
print(dequote('another "normal" script with "extra words in it"'))

注意，返回的元組列表包含帶引號和不帶引號的字符串。 帶引號的字符串在元組的第一個元素中，未帶引號的字符串在第二個元素中。

如果要分開列表，則將它們分開很簡單。

result = dequote('another "normal" script with "extra words in it"')

result_quoted = [t[0].strip() for t in result if t[0]]
result_unquoted = [t[1].strip() for t in result if t[1]]

print("double quoted: {}\nnot double quoted{}".format(
    result_quoted, result_unquoted))

整個程序的輸出：

$ python x.py 
[('normal', ''), ('', ' script'), ('', '')]
[('', 'another '), ('normal', ''), ('', ' script with '), ('extra words in it', ''), ('', '')]
double quoted: ['normal', 'extra words in it']
not double quoted['another', 'script with']

請注意，這暗示着基於re解決方案的解決方案將比基於str.split()的解決方案更快。 我不相信這一點。 考慮以下兩個解決方案：

def dequote_re(s):
    result = re.findall(r'(?:"([^"]*)")|([^"]*)', s)
    result_quoted = [t[0].strip() for t in result if t[0]]
    result_unquoted = [t[1].strip() for t in result if t[1]]
    return result_quoted, result_unquoted

def dequote_split(s):
    result = s.split('"')
    result_unquoted = [item.strip() for item in result[0::2] if item]
    result_quoted = [item.strip() for item in result[1::2] if item]
    return result_quoted, result_unquoted

他們給出相同的答案。 也許您應該運行timeit來找到哪個對您來說更快。

Answer 2

使用正則表達式模塊：

>>> import re, regex
>>> s='"normal" string "is here"'

>>> re.findall(r'"([^"]*)"', s)
['normal', 'is here']

# change \w to appropriate character class as needed
>>> regex.findall(r'"[^"]*"(*SKIP)(*F)|\w+', s)
['string']

# or a workaround, remove double quoted strings first
>>> re.findall(r'\w+', re.sub(r'"([^"]*)"', '', s))
['string']

有關詳細說明，請參見使用（* SKIP）（* FAIL）排除不需要的匹配項。 簡而言之，將(*SKIP)(*F)附加到要排除的正則表達式中，並使用替換定義所需的正則表達式

Answer 3

我知道split()最快，而replace()則比regex快，所以：

output = '"normal" script'.replace('"', '').split()

輸出： ['normal', 'script']

執行時間： 3.490e-05 seconds使用正則表達式，您可以獲得時間beetwen 0.2e-04和0.2e-04 0.3e-04

Answer 4

如果您有很大的字符串，則可以使用正則表達式來計算出現的情況，並設法將其分解成較小的部分（取決於您希望從何處獲得和從何處獲得）。

看來您的子字符串是單詞。 對於雙引號或非雙引號的字符串，可以按子字符串拆分並迭代為列表。

用雙引號或非雙引號分隔可能需要創建兩個列表。

通過單詞拆分，您可以創建單個單詞列表，並在輸出單詞時使用雙引號。

兩種情況的花費幾乎相同，具體取決於獲得的字符串的大小。

我建議使用https://regexr.com並嘗試盡可能多地獲取可能處理的字符串。

我最好的。

正則表達式：給定一個字符串，請在雙引號中查找子字符串，而不在雙引號中查找子字符串

問題描述

4 個解決方案

解決方案1
1 已采納 2018-03-01 14:38:26

解決方案2
1 2018-03-01 14:43:56

解決方案3
0 2018-03-01 14:36:57

解決方案4
0 2018-03-01 14:45:22

正則表達式：給定一個字符串，請在雙引號中查找子字符串，而不在雙引號中查找子字符串

問題描述

4 個解決方案

解決方案1 1 已采納 2018-03-01 14:38:26

解決方案2 1 2018-03-01 14:43:56

解決方案3 0 2018-03-01 14:36:57

解決方案4 0 2018-03-01 14:45:22

解決方案1
1 已采納 2018-03-01 14:38:26

解決方案2
1 2018-03-01 14:43:56

解決方案3
0 2018-03-01 14:36:57

解決方案4
0 2018-03-01 14:45:22