简体繁体 English

正则表达式标记器周期与省略号

[英]regex tokenizer period vs. ellipsis

原文 2022-01-28 21:35:34 8 1 python/ nltk

I want to separate strings by sentence-ending punctuation marks (so ".", "?", ".") excluding ellipsis ("..."), Note that the ellipsis in the text I'm dealing with are three periods.我想通过句尾标点符号（所以“。”，“？”，“。”）分隔字符串，不包括省略号（“...”），请注意，我正在处理的文本中的省略号是三个句点. not a dedicated Unicode string.不是专用的 Unicode 字符串。

Currently what I do is目前我所做的是

tokenizer = nltk.RegexpTokenizer(r"[?.,]+", gaps=True)

But this still splits the string at ... .但这仍然会在...处拆分字符串。 However , I want to keep splitting at !!!但是，我想继续分裂!!! or ??还是?? , just not for multiple consecutive instances of . ，只是不适用于的多个连续实例. . . What's the easiest way to distinguish between the ellipsis and the period, if I want to use RegexpTokenizer?如果我想使用 RegexpTokenizer，区分省略号和句点的最简单方法是什么？

1 个解决方案

Something like this could work [??]+|(.<.\,)\?{1.2}(?!\.)像这样的东西可以工作[??]+|(.<.\,)\?{1.2}(?!\.)

We match either any non-zero amount of ?我们匹配任何非零数量的? and !和! or 1 or 2 dots that are not followed by the dot and not preceded by the dot.或 1 或 2 个点，其后没有点，也没有点前面。

But lookbehind and lookahead have bad performance.但是lookbehind和lookahead的性能很差。

BTW, I found this site https://pythex.org to check python regexes顺便说一句，我发现这个网站https://pythex.org检查 python 正则表达式

HTML解析与Regex - Html Parsing vs. Regex

使用decode（）与regex来解除此字符串的转换 - Using decode() vs. regex to unescape this string

快速查找链接：正则表达式与lxml - Finding links fast: regex vs. lxml

Python-带条件的正则表达式令牌生成器 - Python - regex tokenizer with conditions

在numpy中，使用空元组与省略号对数组进行索引的是什么？ - In numpy, what does indexing an array with the empty tuple vs. ellipsis do?

绘制（时间段内的离散总和）与（时间段）的关系会产生具有不连续性的图形 - Plotting (discrete sum over time period) vs. (time period) yields graph with discontinuities

使用Regex令牌生成器令牌化 - Tokenize with Regex Tokenizer

Python中的正则表达式来检测省略号 - Regex in Python to detect ellipsis

spacy tokenizer 未始终将句点识别为后缀 - spacy tokenizer is not recognizing period as suffix consistently

用于简单表达的Python正则表达式标记生成器 - Python regex tokenizer for simple expression

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 HTML解析与Regex - Html Parsing vs. Regex 使用decode（）与regex来解除此字符串的转换 - Using decode() vs. regex to unescape this string 快速查找链接：正则表达式与lxml - Finding links fast: regex vs. lxml Python-带条件的正则表达式令牌生成器 - Python - regex tokenizer with conditions 在numpy中，使用空元组与省略号对数组进行索引的是什么？ - In numpy, what does indexing an array with the empty tuple vs. ellipsis do? 绘制（时间段内的离散总和）与（时间段）的关系会产生具有不连续性的图形 - Plotting (discrete sum over time period) vs. (time period) yields graph with discontinuities 使用Regex令牌生成器令牌化 - Tokenize with Regex Tokenizer Python中的正则表达式来检测省略号 - Regex in Python to detect ellipsis spacy tokenizer 未始终将句点识别为后缀 - spacy tokenizer is not recognizing period as suffix consistently 用于简单表达的Python正则表达式标记生成器 - Python regex tokenizer for simple expression

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM