简体   繁体   English

在 unicode 和文本上使用 re 拆分 python 字符串

[英]Splitting a python string using re on unicode and text

I'm trying to split a long string based on unicode and text (Chinese) punctuation.我正在尝试根据 unicode 和文本(中文)标点符号拆分一个长字符串。 How do I do this?我该怎么做呢?

def split1(s):
    temp1 = re.split(r"(;|:|•|。|;|:)", s)
    temp = re.split(u"([\u3002|\uFF01|\uFF1F])", temp1)
    i = iter(temp)

UPDATE: I'm hoping to split the string s based on regular text and unicode text.更新:我希望根据常规文本和 unicode 文本拆分字符串 s。

You may use您可以使用

def split1(s): 
    return re.split(ur"([\u3002\uFF01\uFF1F;:•。;:])", s)

It does not make sense to split the two patterns since the only purpose to use them is to tokenize a string into the chars that match the regex and those that do not.拆分这两种模式没有意义,因为使用它们的唯一目的是将字符串标记为与正则表达式匹配的字符和不匹配的字符。

The captured texts will also make part of the resulting list since the whole pattern is wrapped with a capturing group, see re.split docs :捕获的文本也将成为结果列表的一部分,因为整个模式都包含在一个捕获组中,请参阅re.split文档

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list如果在模式中使用捕获括号,则模式中所有组的文本也作为结果列表的一部分返回

Note the u prefix, too, it will tell Python 2.x to correctly handle Unicode code units in the string.还要注意u前缀,它会告诉 Python 2.x 正确处理字符串中的 Unicode 代码单元。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM