简体   繁体   English

正则表达式:提取和匹配特定单词在两个字符之间

[英]Regex: Extract and Match Specific words In between two characters

I need to extract from a string, word that match (way, road, str and street) with every word before and after it up until the comma ',' character or a number in front. 我需要从字符串中提取匹配(way,road,str和street)的单词,前后跟着每​​个单词,直到逗号','字符或前面的数字。

Sample Strings: 示例字符串:
1. Yeet Road, Off Mandy Plant Way, Mando GRA. 1. Yeet Road,Mandy Plant Way,Mando GRA。
2. 3A, Sleek Drive, Off Tremble Rake Street. 2. 3A,Sleek Drive,Off Tremble Rake Street。
3. 57 Radish Slist Road Ikoyi 3. 57 Radish Slist Road Ikoyi

Result should be as close as possible to: 结果应尽可能接近:

  1. Yeet Road Yeet Road
  2. Mandy Plant Way Mandy Plant Way
  3. Tremble Rake Street 颤抖耙街
  4. Radish Slist Road Ikoyi Radish Slist Road Ikoyi

Based on some stack answers, this is what i currently have: 基于一些堆栈答案,这是我目前拥有的:
(?<=\\,)(.*Way|Road|Str|Street?)(?=\\,)

Any help would be appreciated. 任何帮助,将不胜感激。

You can try something like this (with the ignore_case flag) : 你可以尝试这样的东西(使用ignore_case标志)

\b(?:(?!off\b)[a-z]+[^\w,\n]+)*?\b(?:way|road|str(?:eet)?)\b(?:[^\w,\n]+[a-z]+)*

demo 演示

However this kind of patterns, that start to describe an undefined substring of an undefined length before literal parts of the pattern (the keywords), are not efficient. 然而,在模式的文字部分(关键字)之前开始描述未定义长度的未定义子字符串的这种模式效率不高。 This doesn't matter for small strings, but you can't use them in a large string. 这对于小字符串无关紧要,但您不能在大字符串中使用它们。

To exclude particular words you can change (?!off\\b) to (?!off\\b|word1\\b|word2\\b|...) 要排除特定单词,您可以将(?!off\\b)更改为(?!off\\b|word1\\b|word2\\b|...)

Also, you need to be more precise about what characters are allowed or not between words. 此外,您需要更准确地确定单词之间允许或不允许的字符。

You may consider using 你可以考虑使用

^\d+\s*(*SKIP)(*F)|\b[^,]*\b(?:way|r(?:oa)?d|str(?:eet)?)\b[^,]*\b

See the regex demo 请参阅正则表达式演示

Details : 细节

  • ^\\d+\\s*(*SKIP)(*F) - matches and omits the initial 1 or more digits and then 0+ whitespaces at the start of the string ^\\d+\\s*(*SKIP)(*F) - 匹配并省略字符串开头的初始1位或更多位数,然后是0+空格
  • | - or matches... - 或匹配......
  • \\b[^,]*\\b(?:way|r(?:oa)?d|str(?:eet)?)\\b[^,]*\\b - any 0+ chars other than comma, then any of the alternatives in the non-capturing group as whole words, and then again 0+ chars other than comma, the whole subpattern is matched within word boundaries to avoid matching leading/trailing punctuation/whitespace. \\b[^,]*\\b(?:way|r(?:oa)?d|str(?:eet)?)\\b[^,]*\\b - 除逗号以外的任何0+字符,则非捕获组中的任何替代项作为整个单词,然后再次使用除逗号以外的0+字符,整个子模式在单词边界内匹配以避免匹配前导/尾随标点符号/空格。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM