逐个字符分隔 python 字符串，同时保持内联标签完整

Question

I'm trying to make a custom tokenizer in python that works with inline tags.我正在尝试在 python 中制作一个与内联标签一起使用的自定义标记器。 The goal is to take a string input like this:目标是接受这样的字符串输入：

'This is *tag1* a test *tag2*.'

and have it output the a list separated by tag and character:并让它 output 由标签和字符分隔的列表：

['T', 'h', 'i', 's', ' ', 'i', 's', ' ', '*tag1*', ' ',  'a', ' ', 't', 'e', 's', 't', ' ', '*tag2*', '.']

without the tags, I would just use list() , and I think I found a solution for dealing with as single tag type, but there are multiple.没有标签，我只会使用list() ，我想我找到了一个解决方案来处理单个标签类型，但是有多个。 There are also other multi character segments, such as ellipses, that are supposed to be encoded as a single feature.还有其他多字符段，例如椭圆，应该被编码为单个特征。
One thing I tried is replacing the tag with a single unused character with regex and then using list() on the string:我尝试的一件事是用正则表达式用一个未使用的字符替换标签，然后在字符串上使用list() ：

text = 'This is *tag1* a test *tag2*.'
tidx = re.match(r'\*.*?\*', text)
text = re.sub(r'\*.*?\*', r'#', text)
text = list(text)

then I would iterate over it and replace the '#' with the extracted tags, but I have multiple different features I am trying to extract, and reiterating the process multiple times with different placeholder characters before splitting the string seems like poor practice.然后我将对其进行迭代并用提取的标签替换“#”，但我有多个不同的特征要提取，并且在拆分字符串之前使用不同的占位符字符多次重复该过程似乎是不好的做法。 Is there any easier way to do something like this?有没有更简单的方法来做这样的事情？ I'm still quite new to this so there are still a lot of common methods I am unaware of.我对此还是很陌生，所以还有很多我不知道的常用方法。 I guess I can also use a larger regex expression that encompasses all of the features i'm trying to extract but it still feels hacky, and I would prefer to use something more modular that can be used to find other features without writing a new expression every time.我想我也可以使用一个更大的正则表达式，它包含我试图提取的所有特征，但它仍然感觉很笨拙，我更喜欢使用更模块化的东西，可以用来查找其他特征而无需编写新的表达式每次。

Answer 1

You can use the following regex with re.findall :您可以将以下正则表达式与re.findall一起使用：

\*[^*]*\*|.

See the regex demo .请参阅正则表达式演示。 The re.S or re.DOTALL flag can be used with this pattern so that . re.S或re.DOTALL标志可以与此模式一起使用，以便. could also match line break chars that it does not match by default.也可以匹配默认情况下不匹配的换行符。

Details细节

\*[^*]*\* - a * char, followed with zero or more chars other than * and then a * \*[^*]*\* - 一个*字符，后跟零个或多个除*以外的字符，然后是*
| - or - 或者
. - any one char (with re.S ). - 任何一个字符（带有re.S ）。

See the Python demo :请参阅Python 演示：

import re
s = 'This is *tag1* a test *tag2*.'
print( re.findall(r'\*[^*]*\*|.', s, re.S) )
# => ['T', 'h', 'i', 's', ' ', 'i', 's', ' ', '*tag1*', ' ', 'a', ' ', 't', 'e', 's', 't', ' ', '*tag2*', '.']

Answer 2

I'm not sure exactly what would be best for you, but you should be able to use the split() method or the.format() method showcased below to get what you want.我不确定什么最适合你，但你应该能够使用下面展示的 split() 方法或 .format() 方法来获得你想要的。

# you can use this to get what you need
txt = 'This is *tag1* a test *tag2*.'
x = txt.split("*") #Splits up at *
x = txt.split() #Splits all the words up at the spaces
print(x)

# also, you may be looking for something like this to format a string
mystring = 'This is {} a test {}.'.format('*tag1*', '*tag2*')
print(mystring)


# using split to get ['T', 'h', 'i', 's', ' ', 'i', 's', ' ', '*tag1*', ' ',  'a', ' ', 't', 'e', 's', 't', ' ', '*tag2*', '.']
txt = 'This is *tag1* a test *tag2*.'
split = txt.split("*") #Splits up at *

finallist = [] # initialize the list
for string in split:

    # print(string)
    if string == '*tag1*':
        finallist.append(string)
        # pass
    elif string == '*tag2*.':
        finallist.append(string)

    else:
        for x in range(len(string)):
            letter = string[x]
            finallist.append(letter)

print(finallist)

逐个字符分隔 python 字符串，同时保持内联标签完整

问题描述

2 个解决方案

解决方案1
0 已采纳 2020-12-04 23:22:14

解决方案2
0 2020-12-04 23:31:23

逐个字符分隔 python 字符串，同时保持内联标签完整

问题描述

2 个解决方案

解决方案1 0 已采纳 2020-12-04 23:22:14

解决方案2 0 2020-12-04 23:31:23

解决方案1
0 已采纳 2020-12-04 23:22:14

解决方案2
0 2020-12-04 23:31:23