简体   繁体   中英

Python split string on space or sentence inside of parenthesis

I was wondering if it would be possible to split a string such as

string = 'hello world [Im nick][introduction]'

into an array such as

['hello', 'world', '[Im nick][introduction]']

It doesn't have to be efficient, but just a way to get all the words from a sentence split unless they are in brackets, where the whole sentence is not split.

I need this because I have a markdown file with sentences such as

- What is the weather in [San antonio, texas][location]

I need the san antonio texas to be a full sentence inside of an array, would this be possible? The array would look like:

array = ['what', 'is', 'the', 'weather', 'in', 'San antonio, texas][location]']

Maybe this could work for you:

>>> s = 'What is the weather in [San antonio, texas][location]'
>>> i1 = s.index('[')
>>> i2 = s.index('[', i1 + 1)
>>> part_1 = s[:i1].split()    # everything before the first bracket
>>> part_2 = [s[i1:i2], ]      # first bracket pair
>>> part_3 = [s[i2:], ]        # second bracket pair
>>> parts = part_1 + part_2 + part_3
>>> s
'What is the weather in [San antonio, texas][location]'
>>> parts
['What', 'is', 'the', 'weather', 'in', '[San antonio, texas]', '[location]']

It searches for the left brackets and uses that as a reference before splitting by spaces.

This assumes:

  • that there is no other text between the first closing bracket and the second opening bracket.
  • that there is nothing after the second closing bracket

Here is a more robust solution:

def do_split(s):
    parts = []

    while '[' in s:
        start = s.index('[')
        end = s.index(']', s.index(']')+1) + 1  # looks for second closing bracket
        parts.extend(s[:start].split())     # everything before the opening bracket
        parts.append(s[start:end])          # 2 pairs of brackets
        s = s[end:]                         # remove processed part of the string

    parts.extend(s.split())                 # add remainder

    return parts

This yields:

>>> do_split('What is the weather in [San antonio, texas][location] on [friday][date]?')
['What', 'is', 'the', 'weather', 'in', '[San antonio, texas][location]', 'on', '[friday][date]', '?']

Maybe this short snippet can help you. But note that this only works if everything you said holds true for all the entries in the file.

s = 'What is the weather in [San antonio, texas][location]'

s = s.split(' [')
s[1] = '[' + s[1] # add back the split character

mod = s[0] # store in a variable 

mod = mod.split(' ') # split the first part on space

mod.append(s[1]) # attach back the right part

print(mod)

Outputs:

['What', 'is', 'the', 'weather', 'in', '[San antonio, texas][location]']

and for s = 'hello world [Im nick][introduction]'

['hello', 'world', '[Im nick][introduction]']

For an one liner use functional programming tools such as reduce from the functool module

reduce( lambda x, y: x.append(y) if y and y.endswith("]") else x + y.split(), s.split(" ["))

or, slightly shorter with using standard operators, map and sum

sum(map( lambda x: [x] if x and x.endswith("]") else x.split()), []) s.split(" [")) 

you can use regex split with lookbehind/lookahead, note it is simple to filter out empty entries with filter or a list comprehension than avoid in re

import re
s = 'sss sss bbb [zss sss][zsss ss]  sss sss bbb [ss sss][sss ss]'        
[x for x in re.split(r"(?=\[[^\]\[]+\])* ", s)] if x]

This code below will work with your example. Hope it helps :) I'm sure it can be better but now I have to go. Please enjoy.

string = 'hello world [Im nick][introduction]'
list = string.split(' ')
finall = []

for idx, elem in enumerate(list):
    currentelem = elem
    if currentelem[0] == '[' and currentelem[-1] != ']':
        currentelem += list[(idx + 1) % len(list)]
        finall.append(currentelem)
    elif currentelem[0] != '[' and currentelem[-1] != ']':
        finall.append(currentelem)

print(finall)

Let me offer an alternative to the ones above:

import re
string = 'hello world [Im nick][introduction]'
re.findall(r'(\[.+\]|\w+)', string)

Produces:

['hello', 'world', '[Im nick][introduction]']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM