
[英]Python split string by spaces except when in quotes, but keep the quotes
[英]How to split a string by a string except when the string is in quotes in python?
我想用'和'这个词分隔下面的字符串,除非'和'这个词在引号内
string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
期望的结果
["section_category_name = 'computer and equipment expense'","date >= 2015-01-01","date <= 2015-03-31"]
我似乎无法找到正确分割字符串的正确的正则表达式模式,因此“计算机和设备费用”不会被拆分。
这是我试过的:
re.split('and',string)
结果
[" section_category_name = 'computer "," equipment expense' ",' date >= 2015-01-01 ',' date <= 2015-03-31']
如您所见,结果将“计算机和设备费用”拆分为列表中的不同项目。
我也从这个问题尝试了以下内容:
r = re.compile('(?! )[^[]+?(?= *\[)'
'|'
'\[.+?\]')
r.findall(s)
结果:
[]
我也从这个问题尝试了以下内容
result = re.split(r"and+(?=[^()]*(?:\(|$))", string)
结果:
[" section_category_name = 'computer ",
" equipment expense' ",
' date >= 2015-01-01 ',
' date <= 2015-03-31']
挑战在于,关于该主题的先前问题没有解决如何在引号内用字分割字符串,因为它们解决了如何通过特殊字符或空格分割字符串。
如果我将字符串修改为以下内容,我就能得到所需的结果
string = " section_category_name = (computer and equipment expense) and date >= 2015-01-01 and date <= 2015-03-31"
result = re.split(r"and+(?=[^()]*(?:\(|$))", string)
期望的结果
[' section_category_name = (computer and equipment expense) ',
' date >= 2015-01-01 ',
' date <= 2015-03-31']
但是我需要函数不在'和'之间拆分撇号而不是括号
您可以使用re.findall
生成2元组列表,其中第一个元素是带引号的字符串或为空,或者第二个元素是除空白字符之外的任何元素或为空。
然后,您可以使用itertools.groupby
按单词“and”进行分区(当不在带引号的字符串中时),然后从list-comp中的填充元素重新加入,例如:
import re
from itertools import groupby
text = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31 and blah = 'ooops'"
items = [
' '.join(el[0] or el[1] for el in g)
for k, g in groupby(re.findall("('.*?')|(\S+)", text), lambda L: L[1] == 'and')
if not k
]
给你:
["section_category_name = 'computer and equipment expense'",
'date >= 2015-01-01',
'date <= 2015-03-31',
"blah = 'ooops'"]
请注意,空格也在引用的字符串之外进行了规范化 - 无论这是否合乎需要......
还要注意 - 这确实允许在分组方面有一点灵活性,因此你可以将lambda L: L[1] == 'and'
变为lambda L: L[1] in ('and', 'or')
组如果需要等不同的话......
您可以在re.findall
使用以下正则表达式:
((?:(?!\band\b)[^'])*(?:'[^'\\]*(?:\\.[^'\\]*)*'(?:(?!\band\b)[^'])*)*)(?:and|$)
请参阅正则表达式演示 。
正则表达式由一个未包装的序列组成,除了'
到第一个and
(带有调和的贪婪令牌(?:(?!\\band\\b)[^'])*
)和任何东西(支持转义实体)在单个撇号之间(包括'[^'\\\\]*(?:\\\\.[^'\\\\]*)*'
- 这也是([^'\\\\]|\\\\.)*
的未包装版本([^'\\\\]|\\\\.)*
)。
Python 代码演示 :
import re
p = re.compile(r'((?:(?!\band\b)[^\'])*(?:\'[^\'\\]*(?:\\.[^\'\\]*)*\'(?:(?!\band\b)[^\'])*)*)(?:and|$)')
s = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
print([x for x in p.findall(s) if x])
如果所有字符串都遵循相同的模式,则可以使用正则表达式将搜索划分为3组。 第一组从开始到最后一个'。 然后下一组是第一个和最后一个“和”之间的所有内容。 最后一组是文本的其余部分。
import re
string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
output = re.match(r"(^.+['].+['])\sand\s(.+)\sand\s(.+)", string).groups()
print(output)
每个组都在正则表达式的括号内定义。 方括号定义了要匹配的特定字符。 只有“section_category_name”等于单引号内的内容时,此示例才有效。
section_category_name = 'something here' and ...
以下代码将起作用,并且不需要疯狂的正则表达式来实现它。
import re
# We create a "lexer" using regex. This will match strings surrounded by single quotes,
# words without any whitespace in them, and the end of the string. We then use finditer()
# to grab all non-overlapping tokens.
lexer = re.compile(r"'[^']*'|[^ ]+|$")
string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
results = []
buff = []
# Iterate through all the tokens our lexer identified and parse accordingly
for match in lexer.finditer(string):
token = match.group(0) # group 0 is the entire matching string
if token in ('and', ''):
# Once we reach 'and' or the end of the string '' (matched by $)
# We join all previous tokens with a space and add to our results.
results.append(' '.join(buff))
buff = [] # Reset for the next set of tokens
else:
buff.append(token)
print results
编辑 :这是一个更简洁的版本,用itertools.groupby
有效地替换上述语句中的for循环。
import re
from itertools import groupby
string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
lexer = re.compile(r"'[^']*'|[^\s']+")
grouping = groupby(lexer.findall(string), lambda x: x == 'and')
results = [ ' '.join(g) for k, g in grouping if not k ]
print results
我只想使用re.split
具有此功能的事实:
如果在模式中使用捕获括号,则模式中所有组的文本也将作为结果列表的一部分返回。
结合两个捕获组的使用将返回None
分隔字符串的列表。 这使得正则表达式变得简单,尽管需要一些后合并。
>>> tokens = re.split(r"('[^']*')|and", string)
# ['section_category_name = ', "'computer and equipment expense'", ' ', None, ' date >= 2015-01-01 ', None, ' date <= 2015-03-31']
>>> ''.join([t if t else '\0' for t in tokens]).split('\0')
["section_category_name = 'computer and equipment expense' ", ' date >= 2015-01-01 ', ' date <= 2015-03-31']
注意, 0x00
char在那里用作临时分隔符,因此如果您需要处理带有空值的字符串,它将无法正常工作。
我不知道你想做些空格周围什么and
,和你想做些什么重复and
字符串s中。 如果你的字符串是'hello and and bye'
,或'hello andand bye'
'hello and and bye'
,你会想要什么?
我还没有测试过所有的角落情况,并且我在“和”周围删除了空白,这可能是也可能不是你想要的:
string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
res = []
spl = 'and'
for idx, sub in enumerate(string.split("'")):
if idx % 2 == 0:
subsub = sub.split(spl)
for jdx in range(1, len(subsub) - 1):
subsub[jdx] = subsub[jdx].strip()
if len(subsub) > 1:
subsub[0] = subsub[0].rstrip()
subsub[-1] = subsub[-1].lstrip()
res += [i for i in subsub if i.strip()]
else:
quoted_str = "'" + sub + "'"
if res:
res[-1] += quoted_str
else:
res.append(quoted_str)
一个更简单的解决方案,如果您知道and
将被两侧的空间包围,并且不会重复,并且不想删除额外的空格:
string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
spl = 'and'
res = []
spaced_spl = ' ' + spl + ' '
for idx, sub in enumerate(string.split("'")):
if idx % 2 == 0:
res += [i for i in sub.split(spaced_spl) if i.strip()]
else:
quoted_str = "'" + sub + "'"
if res:
res[-1] += quoted_str
else:
res.append(quoted_str)
输出:
["section_category_name = 'computer and equipment expense'", 'date >= 2015-01-01', 'date <= 2015-03-31']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.