[英]Simplifying the extraction of particular string patterns with a multiple if-else and split()
Given a string like this: 给出这样的字符串:
>>> s = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct"
First I want to split the string by underscores, ie: 首先,我想用下划线拆分字符串,即:
>>> s.split('_')
['X/NOUN/dobj>',
'hold/VERB/ROOT',
'<membership/NOUN/dobj',
'<with/ADP/prep',
'<Y/PROPN/pobj',
'>,/PUNCT/punct']
We assume that the underscore is solely used as the delimiter and never exist as part of the substring we want to extract. 我们假设下划线仅用作分隔符,并且从不作为我们想要提取的子字符串的一部分存在。
Then I need to first checks whether each of these "nodes" in my splitted list above starts of ends with a '>', '<', then remove it and put the appropriate bracket as the end of the sublist, something like: 然后我需要首先检查上面我的拆分列表中的每个“节点”是否以“>”,“<”开头,然后将其删除并将相应的括号作为子列表的末尾,如下所示:
result = []
nodes = s.split('_')
for node in nodes:
if node.endswith('>'):
result.append( node[:-1].split('/') + ['>'] )
elif node.startswith('>'):
result.append( node[1:].split('/') + ['>'] )
elif node.startswith('<'):
result.append( node[1:].split('/') + ['<'] )
elif node.endswith('<'):
result.append( node[:-1].split('/') + ['<'] )
else:
result.append( node.split('/') + ['-'] )
And if it doesn't start of ends with an angular bracket then we append -
to the end of the sublist. 如果它不与角支架的开始端的然后我们追加
-
到子列表的末端。
[out]: [OUT]:
[['X', 'NOUN', 'dobj', '>'],
['hold', 'VERB', 'ROOT', '-'],
['membership', 'NOUN', 'dobj', '<'],
['with', 'ADP', 'prep', '<'],
['Y', 'PROPN', 'pobj', '<'],
[',', 'PUNCT', 'punct', '>']]
Given the original input string, is there a less verbose way to get to the result? 给定原始输入字符串,是否有更简洁的方法来获得结果? Maybe with regex and groups?
也许与正则表达式和群组?
s = 'X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct'
def get_sentinal(node):
if not node:
return '-'
# Assuming the node won't contain both '<' and '>' at a same time
for index in [0, -1]:
if node[index] in '<>':
return node[index]
return '-'
results = [
node.strip('<>').split('/') + [get_sentinal(node)]
for node in s.split('_')
]
print(results)
This does not make it significantly shorter , but personally I'd think it's somehow a little bit cleaner . 这不会显着缩短 ,但我个人认为它有点清洁 。
Use this: 用这个:
import re
s_split = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct".split('_')
for i, text in enumerate(s_split):
Left, Mid, Right = re.search('^([<>]?)(.*?)([<>]?)$', text).groups()
s_split[i] = Mid.split('/') + [Left+Right or '-']
print s_split
I can't find a possible answer for a shorter one. 对于较短的一个,我找不到可能的答案。
Use ternary to shorten code. 使用三元缩短代码。 Example:
print None or "a"
will print a
. 示例:
print None or "a"
将打印a
。 And also use regex to parse the occurence of <>
easily. 并且还使用正则表达式来轻松解析
<>
。
Yes, although it's not pretty: 是的,虽然它不漂亮:
s = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct"
import re
out = []
for part in s.split('_'):
Left, Mid, Right = re.search('^([<>]|)(.*?)([<>]|)$', part).groups()
tail = ['-'] if not Left+Right else [Left+Right]
out.append(Mid.split('/') + tail)
print(out)
Try online: https://repl.it/Civg 在线试用: https : //repl.it/Civg
It relies on two main things: 它依赖于两件主要的事情:
()()()
where the edge groups only look for characters <
, >
or nothing ([<>]|)
, and the middle matches everything (non-greedy) (.*?)
. ()()()
,其中边组只查找字符<
, >
或没有([<>]|)
,中间匹配所有(非贪婪) (.*?)
。 The whole thing is anchored at the start ( ^
) and end ( $
) of the string so it consumes the whole input string. ^
)和结束( $
),因此它消耗整个输入字符串。 Left+Right
will either be an empty string plus the character to put at the end, one way or the other, or a completely empty string indicating a dash is required. Left+Right
将是一个空字符串加上要放在末尾的字符,一种方式或另一种方式,或者一个完全空的字符串表示一个破折号是必须的。 Instead of my other answer with regexes, you can drop a lot of lines and a lot of slicing, if you know that string.strip('<>')
will strip either character from both ends of the string, in one move. 如果你知道
string.strip('<>')
将在一次移动中从字符串的两端 string.strip('<>')
任一字符,而不是我的正则表达式的其他答案,你可以删除很多行和大量的切片。
This code is about halfway between your original and my regex answer in linecount, but is more readable for it. 此代码大约介于linecount中的原始答案和我的正则表达式答案之间,但更具可读性。
s = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct"
result = []
for node in s.split('_'):
if node.startswith('>') or node.startswith('<'):
tail = node[0]
elif node.endswith('>') or node.endswith('>'):
tail = node[-1]
else:
tail = '-'
result.append( node.strip('<>').split('/') + [tail])
print(result)
Try online: https://repl.it/Civr 在线试用: https : //repl.it/Civr
Edit: how much less verbose do you want to get? 编辑:你想要得到多少冗长?
result = [node.strip('<>').split('/') + [(''.join(char for char in node if char in '<>') + '-')[0]] for node in s.split('_')]
print(result)
This is quite neat, you don't have to check which side the <>
is on, or whether it's there at all. 这非常简洁,您无需检查
<>
所在的一侧,或者它是否在那里。 One step strip()
s either angle bracket whichever side it's on, the next step filters only the angle brackets out of the string (whichever side they're on) and adds the dash character. 一步
strip()
是角括号,无论它在哪一侧,下一步只过滤掉弦中的尖括号(无论它们在哪一侧)并添加短划线字符。 This is either a string starting with any angle bracket from either side or a single dash. 这是从任一侧的任何角括号开始的字符串或单个短划线。 Take char 0 to get the right one.
取char 0得到正确的。
Even shorter with a list comprehension and some regex magic: 更简单的列表理解和一些正则表达式魔法:
import re
s = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct"
rx = re.compile(r'([<>])|/')
items = [list(filter(None, match)) \
for item in s.split('_') \
for match in [rx.split(item)]]
print(items)
# [['X', 'NOUN', 'dobj', '>'], ['hold', 'VERB', 'ROOT'], ['<', 'membership', 'NOUN', 'dobj'], ['<', 'with', 'ADP', 'prep'], ['<', 'Y', 'PROPN', 'pobj'], ['>', ',', 'PUNCT', 'punct']]
items
by _
, splits it again with the help of the regular expression rx
and filters out empty elements in the end.
_
拆分items
,在正则表达式rx
的帮助下再次拆分它,并在最后过滤掉空元素。
I did not use regex and groups but it can be solution as shorter way. 我没有使用正则表达式和组,但它可以作为更短的方式解决方案。
>>> result=[]
>>> nodes=['X/NOUN/dobj>','hold/VERB/ROOT','<membership/NOUN/dobj',
'<with/ADP/prep','<Y/PROPN/pobj','>,/PUNCT/punct']
>>> for node in nodes:
... nd=node.replace(">",("/>" if node.endswith(">") else ">/"))
... nc=nd.replace("<",("/<" if nd.endswith("<") else "</"))
... result.append(nc.split("/"))
>>> nres=[inner for outer in result for inner in outer] #nres used to join all result at single array. If you dont need single array you can use result.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.