简体   繁体   English

使用多个if-else和split()简化特定字符串模式的提取

[英]Simplifying the extraction of particular string patterns with a multiple if-else and split()

Given a string like this: 给出这样的字符串:

>>> s = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct"

First I want to split the string by underscores, ie: 首先,我想用下划线拆分字符串,即:

>>> s.split('_')
['X/NOUN/dobj>',
 'hold/VERB/ROOT',
 '<membership/NOUN/dobj',
 '<with/ADP/prep',
 '<Y/PROPN/pobj',
 '>,/PUNCT/punct']

We assume that the underscore is solely used as the delimiter and never exist as part of the substring we want to extract. 我们假设下划线仅用作分隔符,并且从不作为我们想要提取的子字符串的一部分存在。

Then I need to first checks whether each of these "nodes" in my splitted list above starts of ends with a '>', '<', then remove it and put the appropriate bracket as the end of the sublist, something like: 然后我需要首先检查上面我的拆分列表中的每个“节点”是否以“>”,“<”开头,然后将其删除并将相应的括号作为子列表的末尾,如下所示:

result = []
nodes = s.split('_')
for node in nodes:
    if node.endswith('>'):
        result.append( node[:-1].split('/') + ['>'] )
    elif node.startswith('>'):
        result.append(  node[1:].split('/') + ['>'] )
    elif node.startswith('<'):
        result.append(  node[1:].split('/') + ['<'] )
    elif node.endswith('<'):
        result.append(  node[:-1].split('/') + ['<'] )
    else:
        result.append(  node.split('/') + ['-'] )

And if it doesn't start of ends with an angular bracket then we append - to the end of the sublist. 如果它不与角支架的开始端的然后我们追加-到子列表的末端。

[out]: [OUT]:

[['X', 'NOUN', 'dobj', '>'],
 ['hold', 'VERB', 'ROOT', '-'],
 ['membership', 'NOUN', 'dobj', '<'],
 ['with', 'ADP', 'prep', '<'],
 ['Y', 'PROPN', 'pobj', '<'],
 [',', 'PUNCT', 'punct', '>']]

Given the original input string, is there a less verbose way to get to the result? 给定原始输入字符串,是否有更简洁的方法来获得结果? Maybe with regex and groups? 也许与正则表达式和群组?

s = 'X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct'

def get_sentinal(node):
    if not node:
        return '-'
    # Assuming the node won't contain both '<' and '>' at a same time
    for index in [0, -1]:
        if node[index] in '<>':
            return node[index]
    return '-'

results = [
    node.strip('<>').split('/') + [get_sentinal(node)]
    for node in s.split('_')
]

print(results)

This does not make it significantly shorter , but personally I'd think it's somehow a little bit cleaner . 这不会显着缩短 ,但我个人认为它有点清洁

Use this: 用这个:

import re
s_split = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct".split('_')
for i, text in enumerate(s_split):
    Left, Mid, Right = re.search('^([<>]?)(.*?)([<>]?)$', text).groups()
    s_split[i] = Mid.split('/') + [Left+Right or '-']

print s_split

I can't find a possible answer for a shorter one. 对于较短的一个,我找不到可能的答案。

Use ternary to shorten code. 使用三元缩短代码。 Example: print None or "a" will print a . 示例: print None or "a"将打印a And also use regex to parse the occurence of <> easily. 并且还使用正则表达式来轻松解析<>

Yes, although it's not pretty: 是的,虽然它不漂亮:

s = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct"

import re

out = []
for part in s.split('_'):
    Left, Mid, Right = re.search('^([<>]|)(.*?)([<>]|)$', part).groups()
    tail = ['-'] if not Left+Right else [Left+Right]
    out.append(Mid.split('/') + tail)

print(out)

Try online: https://repl.it/Civg 在线试用: https//repl.it/Civg

It relies on two main things: 它依赖于两件主要的事情:

  1. a regex pattern which always makes three groups ()()() where the edge groups only look for characters < , > or nothing ([<>]|) , and the middle matches everything (non-greedy) (.*?) . 一个正则表达式模式总是使三个组()()() ,其中边组只查找字符<>或没有([<>]|) ,中间匹配所有(非贪婪) (.*?) The whole thing is anchored at the start ( ^ ) and end ( $ ) of the string so it consumes the whole input string. 整个事情锚定在字符串的开头( ^ )和结束( $ ),因此它消耗整个输入字符串。
  2. Assuming that you will never have angles on both ends of the string, then the combined string Left+Right will either be an empty string plus the character to put at the end, one way or the other, or a completely empty string indicating a dash is required. 假设你永远不会在字符串的两端都有角度,那么组合字符串Left+Right将是一个空字符串加上要放在末尾的字符,一种方式或另一种方式,或者一个完全空的字符串表示一个破折号是必须的。

Instead of my other answer with regexes, you can drop a lot of lines and a lot of slicing, if you know that string.strip('<>') will strip either character from both ends of the string, in one move. 如果你知道string.strip('<>')将在一次移动中从字符串的两端 string.strip('<>') 任一字符,而不是我的正则表达式的其他答案,你可以删除很多行和大量的切片。

This code is about halfway between your original and my regex answer in linecount, but is more readable for it. 此代码大约介于linecount中的原始答案和我的正则表达式答案之间,但更具可读性。

s = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct"

result = []
for node in s.split('_'):
    if node.startswith('>') or node.startswith('<'):
        tail = node[0]
    elif node.endswith('>') or node.endswith('>'):
        tail = node[-1]
    else:
        tail = '-'
    result.append( node.strip('<>').split('/') + [tail])

print(result)

Try online: https://repl.it/Civr 在线试用: https//repl.it/Civr


Edit: how much less verbose do you want to get? 编辑:你想要得到多少冗长?

result = [node.strip('<>').split('/') + [(''.join(char for char in node if char in '<>') + '-')[0]] for node in s.split('_')]
print(result)

This is quite neat, you don't have to check which side the <> is on, or whether it's there at all. 这非常简洁,您无需检查<>所在的一侧,或者它是否在那里。 One step strip() s either angle bracket whichever side it's on, the next step filters only the angle brackets out of the string (whichever side they're on) and adds the dash character. 一步strip()是角括号,无论它在哪一侧,下一步只过滤掉弦中的尖括号(无论它们在哪一侧)并添加短划线字符。 This is either a string starting with any angle bracket from either side or a single dash. 这是从任一侧的任何角括号开始的字符串或单个短划线。 Take char 0 to get the right one. 取char 0得到正确的。

Even shorter with a list comprehension and some regex magic: 更简单的列表理解和一些正则表达式魔法:

import re    
s = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct"

rx = re.compile(r'([<>])|/')
items = [list(filter(None, match)) \
    for item in s.split('_') \
    for match in [rx.split(item)]]

print(items)
# [['X', 'NOUN', 'dobj', '>'], ['hold', 'VERB', 'ROOT'], ['<', 'membership', 'NOUN', 'dobj'], ['<', 'with', 'ADP', 'prep'], ['<', 'Y', 'PROPN', 'pobj'], ['>', ',', 'PUNCT', 'punct']]


Explanation: The code splits the items by _ , splits it again with the help of the regular expression rx and filters out empty elements in the end. 说明:代码通过_拆分items ,在正则表达式rx的帮助下再次拆分它,并在最后过滤掉空元素。
See a demo on ideone.com . ideone.com观看演示。

I did not use regex and groups but it can be solution as shorter way. 我没有使用正则表达式和组,但它可以作为更短的方式解决方案。

>>> result=[]
>>> nodes=['X/NOUN/dobj>','hold/VERB/ROOT','<membership/NOUN/dobj',
 '<with/ADP/prep','<Y/PROPN/pobj','>,/PUNCT/punct']
>>> for node in nodes:
...    nd=node.replace(">",("/>" if node.endswith(">") else ">/"))
...    nc=nd.replace("<",("/<" if nd.endswith("<") else "</"))
...    result.append(nc.split("/"))
>>> nres=[inner for outer in result for inner in outer] #nres used to join all result at single array. If you dont need single array you can use result.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM