简体   繁体   English

正则表达式用于在标点符号之前提取单词

[英]Regex for extracting words before punctuation's

I'm trying to extract phrase which occur before a punctuation but is of the form capitalized words in a phrase. 我试图提取出现在标点符号之前的短语,但该短语的形式为大写单词。

Abstract Algebra . 抽象代数 the area of modern mathematics that considers algebraic structures to be sets with operations defined on them, and extends algebraic concepts usually associated with the real number system to other more general systems, such as groups, rings, fields, modules and vector spaces. 现代数学领域,它考虑将代数结构设置为具有定义的运算的集合,并将通常与实数系统关联的代数概念扩展到其他更通用的系统,例如组,环,场,模块和向量空间。

Algebra. 代数。 a branch of mathematics that uses symbols or letters to represent variables, values or numbers, which can then be used to express operations and relationships and to solve equations. 数学的一个分支,使用符号或字母表示变量,值或数字,然后可以使用它们表示运算和关系以及求解方程式。

Algebraic Expression . 代数表达式 a combination of numbers and letters equivalent to a phrase in language, eg x2 + 3x - 4. 数字和字母的组合,相当于语言中的短语,例如x2 + 3x-4。

Analytic (Cartesian) Geometry: the study of geometry using a coordinate system and the principles of algebra and analysis, thus defining geometrical shapes in a numerical way and extracting numerical information from that representation. 解析(笛卡尔)几何:使用坐标系以及代数和分析原理研究几何,从而以数字方式定义几何形状并从该表示中提取数字信息。

Inductive reasoning or logic: a type of reasoning that involves moving from a set of specific facts to a general conclusion, indicating some degree of support for the conclusion without actually ensuring its truth. 归纳推理或逻辑:一种推理,涉及从一组特定事实转变为一般结论,表示对结论的某种程度的支持,而没有实际确保其真实性。

Currently I'm using the following regex: 目前,我正在使用以下正则表达式:

(([? ])([A-Z][a-z\s]+)?([A-Z][a-z\s]+?[.:]))

I have two issues with this. 我有两个问题。

  1. I think this is not the optimum way of writing it. 我认为这不是最佳的编写方式。
  2. Its not capturing the ones where there are more than two words in a phrase 它无法捕获短语中包含两个以上单词的单词

Try ^[AZ][^.,:';]+ 尝试^[AZ][^.,:';]+

Explanation: 说明:

^ - beginning of a line ^ -行首

[AZ] - single uppercase character [AZ] -单个大写字符

[^.,:';]+ - one or more of characters different from .,:'; [^.,:';]+ -与.,:';不同的一个或多个字符.,:';

Demo 演示

One reason not matching more than 1 word for the current data is that the pattern starts with [? ] 当前数据不匹配超过1个单词的原因之一是模式以[? ] [? ] which will match either a space or question mark. [? ] ,它将匹配空格或问号。

You might also omit some of the capturing groups and use a single one. 您也可以省略某些捕获组,而只使用一个。 Note that you don't have to make this match [az\\s]+?[.:] non greedy using a ? 请注意,您不必使用?使此匹配[az\\s]+?[.:]非贪心? because the character class does not contain a . 因为字符类不包含. or : :

To get the capitalized words followed by either . 要得到大写的单词,紧接着是任一个. or : you could use: :您可以使用:

\b([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)[.:]

Explanation 说明

  • \\b Word boundary \\b字边界
  • ( Capture group 1 (捕获组1
    • [AZ][az]+
    • (?:\\s+[AZ][az]+)* Repeat 0+ times matching AZ and 1+ times az (?:\\s+[AZ][az]+)*重复0+次匹配AZ和1+次Az
  • ) Close group )封闭小组
  • [.:] Match either . [.:]匹配任何一个. or : :

Regex demo 正则表达式演示

If you also want to match words surrounded by ( and ) you might use an alternation. 如果您还想匹配用()包围的单词,则可以使用交替形式。

\b((?:\([A-Z][a-z]+\)|[A-Z][a-z]+)(?:\s+(?:\([A-Z][a-z]+\)|[A-Z][a-z]+))*)[.:]

Regex demo 正则表达式演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 正则表达式灾难性的回溯; 提取单词以大写字母开头,然后是特定单词 - regex catastrophic backtracking ; extracting words starts with capital before the specific word 从字符串中提取单词,删除标点符号并返回带有分隔单词的列表 - Extracting words from a string, removing punctuation and returning a list with separated words 正则表达式匹配后跟空格或标点符号的单词 - Regex to match words followed by whitespace or punctuation RegEx是否可以匹配除标点符号之外的所有非单词? - RegEx for matching all non-words except punctuation? RegEx Tokenizer将文本拆分为单词,数字和标点符号 - RegEx Tokenizer to split a text into words, digits and punctuation marks 正则表达式:如何识别屏幕中的单词(或如何排除标点符号和数字) - Regex: how to identify words in a screen (or how to exclude punctuation and numbers) Python正则表达式匹配-在标点符号上进行分割,但忽略某些单词 - Python Regex Matching - Splitting on punctuation but ignoring certain words 使用python正则表达式从文本中提取单词 - Extracting words from text using python regex 在python中使用正则表达式提取括号内的单词 - Extracting words inside bracket using regex in python 使用正则表达式从字符串中提取多个单词 - extracting multiple words from a string using regex
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM