简体   繁体   English

如何处理正则表达式中的复合词

[英]how to deal with compound words in regex

I am making regexes that return the definitions of abbreviations from a text.我正在制作从文本中返回缩写定义的正则表达式。 I have solved for a number of cases but i cannot make a solution for the case that the abbreviation has different number of characters than its actual words maybe because one word is compound like below.我已经解决了许多情况,但我无法解决缩写与实际单词的字符数不同的情况,这可能是因为一个单词像下面这样复合。

string = 'CRC comes from the words colorectal cancer'

I would like to get the 'colorectal cancer' based on its short-form.我想根据其简短形式获得“结肠直肠癌”。 Do you have any advice on what steps I should take?你对我应该采取什么步骤有什么建议吗? I thought of splitting compounds words, but it will lead to other problems.我想过拆分复合词,但这会导致其他问题。

In CRC the first word should begin with C.在 CRC 中,第一个单词应以 C 开头。 and the next word could be either R or C, if second word is R, third word should be C or there is not a third word at all. and the next word could be either R or C, if second word is R, third word should be C or there is not a third word at all. at the same time you should check second word starts with C.同时您应该检查以 C 开头的第二个单词。 If so you dont need to check for third word.如果是这样,您不需要检查第三个单词。 OR condition in regex maybe upto help.正则表达式中的 OR 条件可能会有所帮助。 I cannot pinpoint how, if I dont have enough data samples如果我没有足够的数据样本,我无法确定如何

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM