简体   繁体   English

带有连接词的正则表达式

[英]Regular expression with connected words

I have been working on the python code to extract document Ids from text documents where IDs can be at the random line in the text using regex.我一直在研究 python 代码以从文本文档中提取文档 ID,其中 ID 可以使用正则表达式位于文本中的随机行。

This document ID is comprised of four letters followed by a hyphen, followed by three numbers and optionally ending in a letter.此文档 ID 由四个字母组成,后跟一个连字符,后跟三个数字,还可以选择以字母结尾。 For example, each of the following is valid document IDs:例如,以下每个都是有效的文档 ID:

  1. ABCD-123 ABCD-123
  2. ABCD-123V ABCD-123V
  3. XKCD-999 XKCD-999
  4. COMP-200 COMP-200

I have tried following regular expression for finding all ids:我尝试使用以下正则表达式来查找所有 ID:

re = re.findall(r"([A-Z]{4})(-)([0-9]{3})([A-Z](?![A-Za-z]))?", text.read())

These expressions work correctly but I have a problem when Ids are connected to words eg XKCD-999James returns XKCD-999 which is correct but if the id is XKCD-999KEight it returns XKCD-999 while the correct answer is XKCD-999K这些表达式可以正常工作,但是当 Id 连接到单词时出现问题,例如XKCD-999James返回XKCD-999这是正确的,但如果 id 是XKCD-999KEight它返回XKCD-999而正确答案是XKCD-999K


So basically I need an approach to separate any alpha characters connected to Words in a given id所以基本上我需要一种方法来分离与给定 id 中的单词相关的任何字母字符

What will be the correct approach for the following problem?以下问题的正确方法是什么?

I have been working on the python code to extract document Ids from text documents where IDs can be at the random line in the text using regex.我一直在研究 python 代码以从文本文档中提取文档 ID,其中 ID 可以使用正则表达式位于文本中的随机行。

This document ID is comprised of four letters followed by a hyphen, followed by three numbers and optionally ending in a letter.此文档 ID 由四个字母组成,后跟一个连字符,后跟三个数字,还可以选择以字母结尾。 For example, each of the following is valid document IDs:例如,以下每个都是有效的文档 ID:

  1. ABCD-123 ABCD-123
  2. ABCD-123V ABCD-123V
  3. XKCD-999 XKCD-999
  4. COMP-200 COMP-200

I have tried following regular expression for finding all ids:我尝试使用以下正则表达式来查找所有 ID:

re = re.findall(r"([A-Z]{4})(-)([0-9]{3})([A-Z](?![A-Za-z]))?", text.read())

These expressions work correctly but I have a problem when Ids are connected to words eg XKCD-999James returns XKCD-999 which is correct but if the id is XKCD-999KEight it returns XKCD-999 while the correct answer is XKCD-999K这些表达式可以正常工作,但是当 Id 连接到单词时出现问题,例如XKCD-999James返回XKCD-999这是正确的,但如果 id 是XKCD-999KEight它返回XKCD-999而正确答案是XKCD-999K


So basically I need an approach to separate any alpha characters connected to Words in a given id所以基本上我需要一种方法来分离与给定 id 中的单词相关的任何字母字符

What will be the correct approach for the following problem?以下问题的正确方法是什么?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM