简体   繁体   English

Python 用空格分割字符串,除非在引号中,但保留引号

[英]Python split string by spaces except when in quotes, but keep the quotes

Am wanting to split the following string:我想拆分以下字符串:

Quantity [*,'EXTRA 05',*]数量 [*,'额外 05',*]

With the desired results being:期望的结果是:

["Quantity", "[*,'EXTRA 05',*]"] ["数量", "[*,'额外 05',*]"]

The closest I have found is using shlex.split, however this removes the internal quotes giving the following result:我发现最接近的是使用 shlex.split,但是这会删除内部引号,结果如下:

['Quantity', '[*,EXTRA 05,*]'] ['数量', '[*,EXTRA 05,*]']

Any suggestions would be greatly appreciated.任何建议将不胜感激。

EDIT:编辑:

Will also require multiple splits such as:还需要多次拆分,例如:

"Quantity [*,'EXTRA 05',*] [*,'EXTRA 09',*]" “数量 [*,'EXTRA 05',*] [*,'EXTRA 09',*]”

To:至:

["Quantity", "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"] ["数量", "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"]

To treat string, the basic way is the regular expression tool ( module re )处理字符串,基本的方法是正则表达式工具(模块re

Given the infos you give (this mean they may be unsufficient) the following code does the job:鉴于您提供的信息(这意味着它们可能不够),以下代码可以完成这项工作:

import re

r = re.compile('(?! )[^[]+?(?= *\[)'
               '|'
               '\[.+?\]')


s1 = "Quantity [*,'EXTRA 05',*] [*,'EXTRA 09',*]"
print r.findall(s1)
print '---------------'      

s2 = "'zug hug'Quantity boondoggle 'fish face monkey "\
     "dung' [*,'EXTRA 05',*] [*,'EXTRA 09',*]"
print r.findall(s2)

result结果

['Quantity', "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"]  
---------------
["'zug hug'Quantity boondoggle 'fish face monkey dung'", "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"]

The regular expression pattern must be undesrtood as follows:必须按如下方式理解正则表达式模式:

'|' means OR意味着或

So the regex pattern expresses two partial RE:所以正则表达式模式表达了两个部分 RE:
(?! )[^[]+?(?= *\\[)
and
\\[.+?\\]

The first partial RE :第一部分 RE :

The core is [^[]+核心是[^[]+
Brackets define a set of characters.括号定义一组字符。 The symbol ^ being after the first bracket [ , it means that the set is defined as all the characters that aren't the ones that follow the symbol ^ .符号^位于第一个括号[ ,这意味着该集合被定义为不是跟在符号^所有字符。
Presently [^[] means any character that isn't an opening bracket [ and, as there's a + after this definition of set, [^[]+ means sequence of characters among them there is no opening bracket .目前[^[]表示任何不是左括号 [ 的字符,并且由于在 set 的这个定义之后有一个+[^[]+表示其中的字符序列没有左括号

Now, there is a question mark after [^[]+ : it means that the sequence catched must stop before what is symbolized just after the question mark.现在,在[^[]+之后有一个问号:这意味着捕获的序列必须在问号之后的符号之前停止。
Here, what follows the ?在这里,接下来是什么? is (?= *\\[) which is a lookahead assertion, composed of (?=....) that signals it is a positive lookahead assertion and of *\\[ , this last part being the sequence in front of which the catched sequence must stop.(?= *\\[)这是一个前瞻断言,由(?=....) ,表示它是一个积极的前瞻断言和*\\[ ,这最后一部分是前面的序列被捕获序列必须停止。 *\\[ means: zero,one or more blanks until the opening bracket (backslash \\ needed to eliminate the meaning of [ as the opening of a set of characters). *\\[表示:零,一个或多个空格直到左括号(反斜杠\\需要消除[作为一组字符的开头的含义)。

There's also (?! ) in front of the core, it's a negative lookahead assertion: it is necessary to make this partial RE to catch only sequences beginning with a blank, so avoiding to catch successions of blanks.在核心前面还有(?! ) ,这是一个否定的前瞻断言:有必要让这个部分 RE 只捕获以空白开头的序列,因此避免捕获连续的空白。 Remove this (?! ) and you'll see the effect.删除这个(?! ) ,你会看到效果。

The second partial RE :第二部分 RE :

\\[.+?\\] means : the opening bracket characater [ , a sequence of characters catched by .+? \\[.+?\\]表示:左括号字符[ ,由.+?捕获的字符序列(the dot matching with any character except \\n ) , this sequence must stop in front of the ending bracket character ] that is the last character to be catched. (与除\\n之外的任何字符匹配的点),此序列必须在结束括号字符]之前停止,该字符是要捕获的最后一个字符。

. .

EDIT编辑

string = "Quantity [*,'EXTRA 05',*] [*,'EXTRA 09',*]"
import re
print re.split(' (?=\[)',string)

result结果

['Quantity', "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"]

!! !!

Advised for picky people, the algorithm WON'T split well every string you pass through it, just strings like:建议挑剔的人,该算法不会很好地分割你通过它的每个字符串,只是字符串:

"Quantity [*,'EXTRA 05',*] [*,'EXTRA 09',*]"

"Quantity [*,'EXTRA 05',*]"

"Quantity [*,'EXTRA 05',*] [*,'EXTRA 10',*] [*,'EXTRA 07',*] [*,'EXTRA 09',*]"

string = "Quantity [*,'EXTRA 05',*] [*,'EXTRA 09',*]"
splitted_string = []

#This adds "Quantity" to the position 0 of splitted_string
splitted_string.append(string.split(" ")[0])     

#The for goes from 1 to the lenght of string.split(" "),increasing the x by 2
#The first iteration x is 1 and x+1 is 2, the second x=3 and x+1=4 etc...
#The first iteration concatenate "[*,'EXTRA" and "05',*]" in one string
#The second iteration concatenate "[*,'EXTRA" and "09',*]" in one string
#If the string would be bigger, it will works
for x in range(1,len(string.split(" ")),2):
    splitted_string.append("%s %s" % (string.split(" ")[x],string.split(" ")[x+1]))

When I execute the code, splitted string at the end contains:当我执行代码时,最后拆分的字符串包含:

['Quantity', "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"]
splitted_string[0] = 'Quantity'
splitted_string[1] = "[*,'EXTRA 05',*]"
splitted_string[2] = "[*,'EXTRA 09',*]"

I think that is exactly what you're looking for.我认为这正是你要找的。 If I'm wrong let me know, or if you need some explanation of the code.如果我错了,请告诉我,或者您需要对代码进行一些解释。 I hope it helps我希望它有帮助

Assuming you want a general solution for splitting at spaces but not on space in quotations: I don't know of any Python library to do this, but there doesn't mean there isn't one.假设您想要一个通用的解决方案,用于在空格处而不是在引号中的空格处进行拆分:我不知道有任何 Python 库可以做到这一点,但这并不意味着没有。

In the absence of a known pre-rolled solution I would simply roll my own.在没有已知的预卷解决方案的情况下,我会简单地推出自己的解决方案。 It's relatively easy to scan a string looking for spaces and then use the Python slice functionality to divide up the string into the parts you want.扫描字符串以查找空格,然后使用 Python 切片功能将字符串划分为您想要的部分相对容易。 To ignore spaces in quotes you can simply include a flag that switches on encountering a quote symbol to switch the space sensing on and off.要忽略引号中的空格,您可以简单地包含一个标志,该标志在遇到引号符号时打开和关闭空格感应。

This is some code I knocked up to do this, it is not extensively tested:这是我为此编写的一些代码,它没有经过广泛测试:

def spaceSplit(string) :
  last = 0
  splits = []
  inQuote = None
  for i, letter in enumerate(string) :
    if inQuote :
      if (letter == inQuote) :
        inQuote = None
    else :
      if (letter == '"' or letter == "'") :
        inQuote = letter

    if not inQuote and letter == ' ' :
      splits.append(string[last:i])
      last = i+1

  if last < len(string) :
    splits.append(string[last:])

  return splits

Try this试试这个

def parseString(inputString):
    output = inputString.split()
    res = []
    count = 0
    temp = []
    for word in output:
        if (word.startswith('"')) and count % 2 == 0:
            temp.append(word)
            count += 1
        elif count % 2 == 1 and not word.endswith('"'):
            temp.append(word)
        elif word.endswith('"'):
            temp.append(word)
            count += 1
            tempWord = ' '.join(temp)
            res.append(tempWord)
            temp = []
        else:
            res.append(word)


    print(res)

Input:输入:

parseString('This is "a test" to your split "string with quotes"') parseString('这是对拆分的“带引号的字符串”的“测试”')

Output: ['This', 'is', '"a test"', 'to', 'your', 'split', '"string with quotes"']输出:['This', 'is', '"a test"', 'to', 'your', 'split', '"带引号的字符串"']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM