简体   繁体   English

逐个字符分隔 python 字符串,同时保持内联标签完整

[英]Seperating a python string by character while keeping inline tags intact

I'm trying to make a custom tokenizer in python that works with inline tags.我正在尝试在 python 中制作一个与内联标签一起使用的自定义标记器。 The goal is to take a string input like this:目标是接受这样的字符串输入:

'This is *tag1* a test *tag2*.'

and have it output the a list separated by tag and character:并让它 output 由标签和字符分隔的列表:

['T', 'h', 'i', 's', ' ', 'i', 's', ' ', '*tag1*', ' ',  'a', ' ', 't', 'e', 's', 't', ' ', '*tag2*', '.']

without the tags, I would just use list() , and I think I found a solution for dealing with as single tag type, but there are multiple.没有标签,我只会使用list() ,我想我找到了一个解决方案来处理单个标签类型,但是有多个。 There are also other multi character segments, such as ellipses, that are supposed to be encoded as a single feature.还有其他多字符段,例如椭圆,应该被编码为单个特征。
One thing I tried is replacing the tag with a single unused character with regex and then using list() on the string:我尝试的一件事是用正则表达式用一个未使用的字符替换标签,然后在字符串上使用list()

text = 'This is *tag1* a test *tag2*.'
tidx = re.match(r'\*.*?\*', text)
text = re.sub(r'\*.*?\*', r'#', text)
text = list(text)

then I would iterate over it and replace the '#' with the extracted tags, but I have multiple different features I am trying to extract, and reiterating the process multiple times with different placeholder characters before splitting the string seems like poor practice.然后我将对其进行迭代并用提取的标签替换“#”,但我有多个不同的特征要提取,并且在拆分字符串之前使用不同的占位符字符多次重复该过程似乎是不好的做法。 Is there any easier way to do something like this?有没有更简单的方法来做这样的事情? I'm still quite new to this so there are still a lot of common methods I am unaware of.我对此还是很陌生,所以还有很多我不知道的常用方法。 I guess I can also use a larger regex expression that encompasses all of the features i'm trying to extract but it still feels hacky, and I would prefer to use something more modular that can be used to find other features without writing a new expression every time.我想我也可以使用一个更大的正则表达式,它包含我试图提取的所有特征,但它仍然感觉很笨拙,我更喜欢使用更模块化的东西,可以用来查找其他特征而无需编写新的表达式每次。

You can use the following regex with re.findall :您可以将以下正则表达式与re.findall一起使用:

\*[^*]*\*|.

See the regex demo .请参阅正则表达式演示 The re.S or re.DOTALL flag can be used with this pattern so that . re.Sre.DOTALL标志可以与此模式一起使用,以便. could also match line break chars that it does not match by default.也可以匹配默认情况下不匹配的换行符。

Details细节

  • \*[^*]*\* - a * char, followed with zero or more chars other than * and then a * \*[^*]*\* - 一个*字符,后跟零个或多个除*以外的字符,然后是*
  • | - or - 或者
  • . - any one char (with re.S ). - 任何一个字符(带有re.S )。

See the Python demo :请参阅Python 演示

import re
s = 'This is *tag1* a test *tag2*.'
print( re.findall(r'\*[^*]*\*|.', s, re.S) )
# => ['T', 'h', 'i', 's', ' ', 'i', 's', ' ', '*tag1*', ' ', 'a', ' ', 't', 'e', 's', 't', ' ', '*tag2*', '.']

I'm not sure exactly what would be best for you, but you should be able to use the split() method or the.format() method showcased below to get what you want.我不确定什么最适合你,但你应该能够使用下面展示的 split() 方法或 .format() 方法来获得你想要的。

# you can use this to get what you need
txt = 'This is *tag1* a test *tag2*.'
x = txt.split("*") #Splits up at *
x = txt.split() #Splits all the words up at the spaces
print(x)

# also, you may be looking for something like this to format a string
mystring = 'This is {} a test {}.'.format('*tag1*', '*tag2*')
print(mystring)


# using split to get ['T', 'h', 'i', 's', ' ', 'i', 's', ' ', '*tag1*', ' ',  'a', ' ', 't', 'e', 's', 't', ' ', '*tag2*', '.']
txt = 'This is *tag1* a test *tag2*.'
split = txt.split("*") #Splits up at *

finallist = [] # initialize the list
for string in split:

    # print(string)
    if string == '*tag1*':
        finallist.append(string)
        # pass
    elif string == '*tag2*.':
        finallist.append(string)

    else:
        for x in range(len(string)):
            letter = string[x]
            finallist.append(letter)

print(finallist)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python 3 - 如何将字符串中的每个字符拆分为列表,同时保持十进制数字不变? - Python 3 - How to split every character in a string into a list while keeping decimal numbers intact? Python-扩展其他字符串时保持字符串完整 - Python - Keeping string intact while extending other string 使用 Python 创建数据集的可编辑副本,同时保持原始数据不变 - Creating editable copy of a dataset while keeping the original intact using Python python中有没有办法替换字符串但保持中间字符完好无损? - Is there a way in python to replace a string but leave a middle character intact? Python 使用多个字符拆分字符串,同时仍保留该字符 - Python Split a String using more than one character while still keeping that character 替换 \n 同时保持 \r\n 不变 - Replacing \n while keeping \r\n intact 操纵变量的一部分,同时保持原始状态不变 - Manipulate a part of variable while keeping the original intact Python 3:在打印和格式化时保持十六进制命令字符串完整 - Python 3 : keep string of hex commands intact while printing and formatting Python,在保留空格的同时间隔分割字符串? - Python, Slicing a string at intervals while keeping the spaces? 在保留格式的同时用Python替换字符串中的单词 - Replace word in string in Python while keeping formatting
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM