简体   繁体   中英

split at double quotes in python

import shlex
fil=open("./demoshlex.txt",'r')
line=fil.readline()
print line
print shlex.split(line)

suppose my sting is as below in a text file

line1 :

asfdsafadfa "Tabvxc "avcx"sdasaf" sadasfdf. sdsadsaf '0000000000000000000000000000000'." is something

I want to split the line and form list as follows

[asfdsafadfa, "Tabvxc "avcx"sdasaf" sadasfdf. sdsadsaf '0000000000000000000000000000000'.", is something]

i tried using shlex.split but it gave me exception, putting code and exception

**Output:**
python basicshelx.py
asfdsafadfa "Tabvxc "avcx"sdasaf" sadasfdf. sdsadsaf '0000000000000000000000000000000'."

Traceback (most recent call last):
File "basicshelx.py", line 5, in <module>
print shlex.split(line)
File "/home/siddhant/sid/.local/lib/python2.7/shlex.py", line 279, in split
return list(lex)
File "/home/siddhant/sid/.local/lib/python2.7/shlex.py", line 269, in next
token = self.get_token()
File "/home/siddhant/sid/.local/lib/python2.7/shlex.py", line 96, in get_token
raw = self.read_token()
File "/home/siddhant/sid/.local/lib/python2.7/shlex.py", line 172, in read_token
raise ValueError, "No closing quotation"
ValueError: No closing quotation

It seems to me that you want to split only on the first occurance of " and want to keep all " in the second element of your output list.

Here is an example using just standard libraries, no import needed:

result = []
with open('test.txt', 'r') as openfile:
    for line in openfile:
        # strip spaces and \n from the line
        line = line.strip()
        # split the line on "
        my_list = line.split('"')
        # only append first element of the list to the result
        result.append(my_list[0].strip())
        # rebuild the second part, adding back in the "
        remainder = '"' + '"'.join([a for a in my_list[1:]])
        # append the second part to the result
        result.append(remainder)
print(result)

output:

['asfdsafadfa', '"Tabvxc "avcx"sdasaf" sadasfdf. sdsadsaf \'0000000000000000000000000000000\'."']

or if you print the individual elements of the output list:

for e in result:
    print(e)

output:

asfdsafadfa
"Tabvxc "avcx"sdasaf" sadasfdf. sdsadsaf '0000000000000000000000000000000'."

[Edit based on comment]

As per comments you can use .split('"', 1) , example:

with open('test.txt', 'r') as openfile:
    for line in openfile:
        # strip spaces and \n from the line
        line = line.strip()
        # split the line on " but only the fist one
        result = line.split('"', 1)
        # add in the " for the second element
        result[1] = '"' + result[1]

[Edit based on updated question and comment]

Comment from OP:

I want only the quoted part ie remove "is something" from that element of result List and make it [2] element

As the question is updated with a trailing "is something" string on the input, which need to be omitted in the output, the example now becomes as follows:

with open('test.txt', 'r') as openfile:
    for line in openfile:
        # strip spaces and \n from the line
        line = line.strip()
        # split the line on " but only the fist one
        result = line.split('"', 1)
        # add in the " for the second element, remove trailing string
        result[1] = '"{}"'.format(result[1].rsplit('"', 1)[0])

however a file is likely to contain multiple lines, if this is the case you need to build up a list of outputs, one output for each line. The example now becomes as follows:

result = []
with open('test.txt', 'r') as openfile:
    for line in openfile:
        if '"' in line:
            # we can split the line on "
            line = line.strip().split('"', 1)
            if line[1][-1] == '"':
                # no trailing string to remove
                # pre-fix second element with "
                line[1] = '"{}'.format(line[1])
            elif '"' in line[1]:
                # trailing string to be removed with .rsplit()[0]
                # post- and pre-fix " for second element 
                line[1] = '"{}"'.format(line[1].rsplit('"', 1)[0])
        else:
            # no " in line, return line as one element list
            line = [line.strip()]
        result.append(line)

# result is now a list of lists
for line in result:
    for e in line:
        print(e)

Best way would be to use re

s = '''asfdsafadfa "Tabvxc "avcx"sdasaf" sadasfdf. sdsadsaf '0000000000000000000000000000000'." is something'''''

pat = re.compile(
    r'''
    ^      # beginning of a line
    (.*?)  # first part. the *? means non-greedy
    (".*") # part between the outermost ", ("-included)
    (.*?)  # last part
    $      # end of a line
    ''', re.DOTALL|re.VERBOSE)
 pat.match(s).groups() 
('asfdsafadfa ',
 '"Tabvxc "avcx"sdasaf" sadasfdf. sdsadsaf \'0000000000000000000000000000000\'."',
 ' is something')

so in total this would become:

test_str = '''asfdsafadfa "Tabvxc "avcx"sdasaf" sadasfdf. sdsadsaf '0000000000000000000000000000000'." is something
asfdsafadfa "Tabvxc "avcx"sdasaf" sadasfdf. sdsadsaf '0000000000000000000000000000000'."
asfdsafadfa Tabvxc avcxsdasaf sadasfdf. sdsadsaf '0000000000000000000000000000000'.
'''
def split_lines(filehandle):
    pat = re.compile(r'''^(.*?)(".*")(.*?)$''', re.DOTALL)
    for line in filehandle:
        match = pat.match(line)
        if match:
            yield match.groups()
        else:
            yield line

with StringIO(test_str) as openfile:
    for line in split_lines(openfile):
        print(line)

The first generator splits the open filehandle in different lines. Then it tries to split the line. If it succeeds, it yields a tuple with the different parts, otherwise it yields the original string.

In your actual programs you can replace the StringIO(test_str) with open(filename, 'r')

 ('asfdsafadfa ', '"Tabvxc "avcx"sdasaf" sadasfdf. sdsadsaf \\'0000000000000000000000000000000\\'."', ' is something') ('asfdsafadfa ', '"Tabvxc "avcx"sdasaf" sadasfdf. sdsadsaf \\'0000000000000000000000000000000\\'."', '') asfdsafadfa Tabvxc avcxsdasaf sadasfdf. sdsadsaf '0000000000000000000000000000000'. 

Your original string seems badly quoted to start with. You can escape quotes by preceding them with a \\ like so :

my_var = "Tabvxc \"avcx\"sdasaf\" sadasfdf. sdsadsaf '0000000000000000000000000000000'."

You can then proceed with splitting it like so :

my_var.split('"')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM