使用多个if-else和split（）简化特定字符串模式的提取

Question

Given a string like this: 给出这样的字符串：

>>> s = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct"

First I want to split the string by underscores, ie: 首先，我想用下划线拆分字符串，即：

>>> s.split('_')
['X/NOUN/dobj>',
 'hold/VERB/ROOT',
 '<membership/NOUN/dobj',
 '<with/ADP/prep',
 '<Y/PROPN/pobj',
 '>,/PUNCT/punct']

We assume that the underscore is solely used as the delimiter and never exist as part of the substring we want to extract. 我们假设下划线仅用作分隔符，并且从不作为我们想要提取的子字符串的一部分存在。

Then I need to first checks whether each of these "nodes" in my splitted list above starts of ends with a '>', '<', then remove it and put the appropriate bracket as the end of the sublist, something like: 然后我需要首先检查上面我的拆分列表中的每个“节点”是否以“>”，“<”开头，然后将其删除并将相应的括号作为子列表的末尾，如下所示：

result = []
nodes = s.split('_')
for node in nodes:
    if node.endswith('>'):
        result.append( node[:-1].split('/') + ['>'] )
    elif node.startswith('>'):
        result.append(  node[1:].split('/') + ['>'] )
    elif node.startswith('<'):
        result.append(  node[1:].split('/') + ['<'] )
    elif node.endswith('<'):
        result.append(  node[:-1].split('/') + ['<'] )
    else:
        result.append(  node.split('/') + ['-'] )

And if it doesn't start of ends with an angular bracket then we append - to the end of the sublist. 如果它不与角支架的开始端的然后我们追加-到子列表的末端。

[out]: [OUT]：

[['X', 'NOUN', 'dobj', '>'],
 ['hold', 'VERB', 'ROOT', '-'],
 ['membership', 'NOUN', 'dobj', '<'],
 ['with', 'ADP', 'prep', '<'],
 ['Y', 'PROPN', 'pobj', '<'],
 [',', 'PUNCT', 'punct', '>']]

Given the original input string, is there a less verbose way to get to the result? 给定原始输入字符串，是否有更简洁的方法来获得结果？ Maybe with regex and groups? 也许与正则表达式和群组？

Answer 1

s = 'X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct'

def get_sentinal(node):
    if not node:
        return '-'
    # Assuming the node won't contain both '<' and '>' at a same time
    for index in [0, -1]:
        if node[index] in '<>':
            return node[index]
    return '-'

results = [
    node.strip('<>').split('/') + [get_sentinal(node)]
    for node in s.split('_')
]

print(results)

This does not make it significantly shorter , but personally I'd think it's somehow a little bit cleaner . 这不会显着缩短，但我个人认为它有点清洁 。

Answer 2

Use this: 用这个：

import re
s_split = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct".split('_')
for i, text in enumerate(s_split):
    Left, Mid, Right = re.search('^([<>]?)(.*?)([<>]?)$', text).groups()
    s_split[i] = Mid.split('/') + [Left+Right or '-']

print s_split

I can't find a possible answer for a shorter one. 对于较短的一个，我找不到可能的答案。

Use ternary to shorten code. 使用三元缩短代码。 Example: print None or "a" will print a . 示例： print None or "a"将打印a 。 And also use regex to parse the occurence of <> easily. 并且还使用正则表达式来轻松解析<> 。

Answer 3

Yes, although it's not pretty: 是的，虽然它不漂亮：

s = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct"

import re

out = []
for part in s.split('_'):
    Left, Mid, Right = re.search('^([<>]|)(.*?)([<>]|)$', part).groups()
    tail = ['-'] if not Left+Right else [Left+Right]
    out.append(Mid.split('/') + tail)

print(out)

Try online: https://repl.it/Civg 在线试用： https ： //repl.it/Civg

It relies on two main things: 它依赖于两件主要的事情：

a regex pattern which always makes three groups ()()() where the edge groups only look for characters < , > or nothing ([<>]|) , and the middle matches everything (non-greedy) (.*?) . 一个正则表达式模式总是使三个组()()() ，其中边组只查找字符< ， >或没有([<>]|) ，中间匹配所有（非贪婪） (.*?) 。 The whole thing is anchored at the start ( ^ ) and end ( $ ) of the string so it consumes the whole input string. 整个事情锚定在字符串的开头（ ^ ）和结束（ $ ），因此它消耗整个输入字符串。
Assuming that you will never have angles on both ends of the string, then the combined string Left+Right will either be an empty string plus the character to put at the end, one way or the other, or a completely empty string indicating a dash is required. 假设你永远不会在字符串的两端都有角度，那么组合字符串Left+Right将是一个空字符串加上要放在末尾的字符，一种方式或另一种方式，或者一个完全空的字符串表示一个破折号是必须的。

Answer 4

Instead of my other answer with regexes, you can drop a lot of lines and a lot of slicing, if you know that string.strip('<>') will strip either character from both ends of the string, in one move. 如果你知道string.strip('<>')将在一次移动中从字符串的两端 string.strip('<>') 任一字符，而不是我的正则表达式的其他答案，你可以删除很多行和大量的切片。

This code is about halfway between your original and my regex answer in linecount, but is more readable for it. 此代码大约介于linecount中的原始答案和我的正则表达式答案之间，但更具可读性。

s = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct"

result = []
for node in s.split('_'):
    if node.startswith('>') or node.startswith('<'):
        tail = node[0]
    elif node.endswith('>') or node.endswith('>'):
        tail = node[-1]
    else:
        tail = '-'
    result.append( node.strip('<>').split('/') + [tail])

print(result)

Try online: https://repl.it/Civr 在线试用： https ： //repl.it/Civr

Edit: how much less verbose do you want to get? 编辑：你想要得到多少冗长？

result = [node.strip('<>').split('/') + [(''.join(char for char in node if char in '<>') + '-')[0]] for node in s.split('_')]
print(result)

This is quite neat, you don't have to check which side the <> is on, or whether it's there at all. 这非常简洁，您无需检查<>所在的一侧，或者它是否在那里。 One step strip() s either angle bracket whichever side it's on, the next step filters only the angle brackets out of the string (whichever side they're on) and adds the dash character. 一步strip()是角括号，无论它在哪一侧，下一步只过滤掉弦中的尖括号（无论它们在哪一侧）并添加短划线字符。 This is either a string starting with any angle bracket from either side or a single dash. 这是从任一侧的任何角括号开始的字符串或单个短划线。 Take char 0 to get the right one. 取char 0得到正确的。

Answer 5

Even shorter with a list comprehension and some regex magic: 更简单的列表理解和一些正则表达式魔法：

import re    
s = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct"

rx = re.compile(r'([<>])|/')
items = [list(filter(None, match)) \
    for item in s.split('_') \
    for match in [rx.split(item)]]

print(items)
# [['X', 'NOUN', 'dobj', '>'], ['hold', 'VERB', 'ROOT'], ['<', 'membership', 'NOUN', 'dobj'], ['<', 'with', 'ADP', 'prep'], ['<', 'Y', 'PROPN', 'pobj'], ['>', ',', 'PUNCT', 'punct']]

Explanation: The code splits the items by _ , splits it again with the help of the regular expression rx and filters out empty elements in the end. 说明：代码通过_拆分items ，在正则表达式rx的帮助下再次拆分它，并在最后过滤掉空元素。

See a demo on ideone.com . 在ideone.com上观看演示。

Answer 6

I did not use regex and groups but it can be solution as shorter way. 我没有使用正则表达式和组，但它可以作为更短的方式解决方案。

>>> result=[]
>>> nodes=['X/NOUN/dobj>','hold/VERB/ROOT','<membership/NOUN/dobj',
 '<with/ADP/prep','<Y/PROPN/pobj','>,/PUNCT/punct']
>>> for node in nodes:
...    nd=node.replace(">",("/>" if node.endswith(">") else ">/"))
...    nc=nd.replace("<",("/<" if nd.endswith("<") else "</"))
...    result.append(nc.split("/"))
>>> nres=[inner for outer in result for inner in outer] #nres used to join all result at single array. If you dont need single array you can use result.

使用多个if-else和split（）简化特定字符串模式的提取

问题描述

6 个解决方案

解决方案1
3 已采纳 2016-08-03 05:21:32

解决方案2
2 2016-08-03 05:57:20

解决方案3
1 2016-08-03 05:46:42

解决方案4
1 2016-08-03 05:53:23

解决方案5
1 2016-08-03 09:37:09

解决方案6
0 2016-08-03 06:44:58

使用多个if-else和split（）简化特定字符串模式的提取

问题描述

6 个解决方案

解决方案1 3 已采纳 2016-08-03 05:21:32

解决方案2 2 2016-08-03 05:57:20

解决方案3 1 2016-08-03 05:46:42

解决方案4 1 2016-08-03 05:53:23

解决方案5 1 2016-08-03 09:37:09

解决方案6 0 2016-08-03 06:44:58

解决方案1
3 已采纳 2016-08-03 05:21:32

解决方案2
2 2016-08-03 05:57:20

解决方案3
1 2016-08-03 05:46:42

解决方案4
1 2016-08-03 05:53:23

解决方案5
1 2016-08-03 09:37:09

解决方案6
0 2016-08-03 06:44:58