简体   繁体   English

正则表达式标记器周期与省略号

[英]regex tokenizer period vs. ellipsis

I want to separate strings by sentence-ending punctuation marks (so ".", "?", ".") excluding ellipsis ("..."), Note that the ellipsis in the text I'm dealing with are three periods.我想通过句尾标点符号(所以“。”,“?”,“。”)分隔字符串,不包括省略号(“...”),请注意,我正在处理的文本中的省略号是三个句点. not a dedicated Unicode string.不是专用的 Unicode 字符串。

Currently what I do is目前我所做的是

tokenizer = nltk.RegexpTokenizer(r"[?.,]+", gaps=True)

But this still splits the string at ... .但这仍然会在...处拆分字符串。 However , I want to keep splitting at !!!但是,我想继续分裂!!! or ??还是?? , just not for multiple consecutive instances of . ,只是不适用于 的多个连续实例. . . What's the easiest way to distinguish between the ellipsis and the period, if I want to use RegexpTokenizer?如果我想使用 RegexpTokenizer,区分省略号和句点的最简单方法是什么?

Something like this could work [??]+|(.<.\,)\?{1.2}(?!\.)像这样的东西可以工作[??]+|(.<.\,)\?{1.2}(?!\.)

We match either any non-zero amount of ?我们匹配任何非零数量的? and !! or 1 or 2 dots that are not followed by the dot and not preceded by the dot.或 1 或 2 个点,其后没有点,也没有点前面。

But lookbehind and lookahead have bad performance.但是lookbehind和lookahead的性能很差。

BTW, I found this site https://pythex.org to check python regexes顺便说一句,我发现这个网站https://pythex.org检查 python 正则表达式

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM