简体   繁体   English

正则表达式到 select 匹配括号的单词

[英]Regex to select the words those match brackets

I need to extract the full forms using regex using javascript我需要使用 javascript 使用正则表达式提取完整的 forms

I have tried with我试过了

(\w+\s*[^a-z^A-Z]*){3}\s*\([A-Z]*\)

but the extraction fails when there are some full forms like these但是当有像这样的一些完整的 forms 时,提取失败

Most common mis'take (MCM) (only bold is selected)常见的错误(MCM)(仅选择粗体)

Below is the text for testing:以下是用于测试的文本:

The task of automatically extracting acronymdefinition pairs from biomedical literature has从生物医学文献中自动提取首字母缩写词定义对的任务有

Most common mis'take (MCM) been studied, almost exclusively for English, over the past few decades using technologies from Natural Language Processing (NLP).在过去的几十年中,使用自然语言处理 (NLP) 的技术研究了最常见的错误 (MCM),几乎完全针对英语。 This section 167 presents a few approaches and techniques that were applied to the acronym identification task.第 167 节介绍了一些应用于首字母缩写词识别任务的方法和技术。 Taghva and Gilbreth (1999) present the Acronyms 7'- $ **** Finding Program (AFP) Taghva 和 Gilbreth (1999) 提出了首字母缩略词 7'- $ **** 查找程序 (AFP)

, based on pattern matching. ,基于模式匹配。 Their program seeks for acronym candidates which appear as upper case words.他们的程序寻找显示为大写单词的首字母缩写词候选者。 They calculate a heuristic score for each competing definition by classifying words into: (1) stop words (”the”, ”of”, ”and”), (2) hyphenated words (3) normal words (words that don't fall into any of the above categories) and (4) the acronyms themselves (since an acronym can sometimes be a part of the definition).他们通过将单词分类为:(1)停用词(“the”、“of”、“and”),(2)连字符(3)正常词(不落下的词)来计算每个竞争定义的启发式分数(4) 首字母缩略词本身(因为首字母缩略词有时可能是定义的一部分)。 The AFP utilizes the Longest Common Subsequence (LCS) algorithm (Hunt and Szymanski, 1977) to find all possible alignments of the acronym to the text, followed by simple scoring rules which are based on matches. AFP 使用最长公共子序列 (LCS) 算法 (Hunt and Szymanski, 1977) 来查找首字母缩略词与文本的所有可能对齐方式,然后是基于匹配的简单评分规则。 The performance reported from their experiment are: recall of 86% at precision of 98%他们的实验报告的性能是:召回率为 86%,准确率为 98%

Instead of repeating the group 3 times, you could use 3 capturing groups with a backreference to those groups matching the first letter of the word.您可以使用 3 个捕获组并反向引用与单词首字母匹配的那些组,而不是重复该组 3 次。

\b(\w)[\w']*[^a-zA-Z()]* (\w)[\w']*[^a-zA-Z()]* (\w)[\w']*[^a-zA-Z()]* \(\1\2\3\)
  • \b Word boundary \b字边界
  • (\w) Match a single word char in group 1 (\w)匹配组 1中的单个单词 char
  • [\w']* Match 0+ times a word char or ' [\w']*匹配 0+ 次单词 char 或'
  • [^a-zA-Z()]* Match 0+ times any char except the listed, then match a space [^a-zA-Z()]*匹配除所列字符以外的任何字符 0+ 次,然后匹配一个空格
  • (\w)[\w'] [^a-zA-Z()] Same as above with group 2 (\w)[\w'] [^a-zA-Z()]与上面第 2 组相同
  • (\w)[\w'] [^a-zA-Z()] Same as above with group 3 (\w)[\w'] [^a-zA-Z()]与上面第 3 组相同
  • (\1\2\3) Between parenthesis, use the 3 backreferences to the capturing groups (\1\2\3)在括号之间,使用对捕获组的 3 个反向引用

Regex demo正则表达式演示


You could also update your pattern by adding the ' to the character class and repeat that 0+ times [\w']*您还可以通过将'添加到字符 class 并重复 0+ 次[\w']*来更新您的模式

You can extend the character class with characters you would allow to match.您可以使用您允许匹配的字符扩展字符 class。

\b(?:\w[\w']* [^a-zA-Z]*){3} ?\([A-Z]{3}\)
  • \b Word boundary \b字边界
  • (?: Non capture group (?:非捕获组
    • \w[\w']* Match a word char and 0+ times any char except a word char or ', then match a space \w[\w']*匹配一个单词 char 和 0+ 次除单词 char 或 ' 之外的任何字符,然后匹配一个空格
    • [^a-zA-Z]* Match 0+ times any char except a-zA-Z [^a-zA-Z]*匹配除 a-zA-Z 以外的任何字符 0+ 次
  • ){3}? Repeat 3 times and match optional space重复3次并匹配可选空间
  • \([AZ]{3}\) Match 3 occurrences of AZ between parenthesis \([AZ]{3}\)匹配括号之间出现的 3 次 AZ

Regex demo正则表达式演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM