带有连接词的正则表达式

Question

I have been working on the python code to extract document Ids from text documents where IDs can be at the random line in the text using regex.我一直在研究 python 代码以从文本文档中提取文档 ID，其中 ID 可以使用正则表达式位于文本中的随机行。

This document ID is comprised of four letters followed by a hyphen, followed by three numbers and optionally ending in a letter.此文档 ID 由四个字母组成，后跟一个连字符，后跟三个数字，还可以选择以字母结尾。 For example, each of the following is valid document IDs:例如，以下每个都是有效的文档 ID：

ABCD-123 ABCD-123
ABCD-123V ABCD-123V
XKCD-999 XKCD-999
COMP-200 COMP-200

I have tried following regular expression for finding all ids:我尝试使用以下正则表达式来查找所有 ID：

re = re.findall(r"([A-Z]{4})(-)([0-9]{3})([A-Z](?![A-Za-z]))?", text.read())

These expressions work correctly but I have a problem when Ids are connected to words eg XKCD-999James returns XKCD-999 which is correct but if the id is XKCD-999KEight it returns XKCD-999 while the correct answer is XKCD-999K这些表达式可以正常工作，但是当 Id 连接到单词时出现问题，例如XKCD-999James返回XKCD-999这是正确的，但如果 id 是XKCD-999KEight它返回XKCD-999而正确答案是XKCD-999K

So basically I need an approach to separate any alpha characters connected to Words in a given id所以基本上我需要一种方法来分离与给定 id 中的单词相关的任何字母字符

What will be the correct approach for the following problem?以下问题的正确方法是什么？

Answer 1

I have been working on the python code to extract document Ids from text documents where IDs can be at the random line in the text using regex.我一直在研究 python 代码以从文本文档中提取文档 ID，其中 ID 可以使用正则表达式位于文本中的随机行。

This document ID is comprised of four letters followed by a hyphen, followed by three numbers and optionally ending in a letter.此文档 ID 由四个字母组成，后跟一个连字符，后跟三个数字，还可以选择以字母结尾。 For example, each of the following is valid document IDs:例如，以下每个都是有效的文档 ID：

ABCD-123 ABCD-123
ABCD-123V ABCD-123V
XKCD-999 XKCD-999
COMP-200 COMP-200

I have tried following regular expression for finding all ids:我尝试使用以下正则表达式来查找所有 ID：

re = re.findall(r"([A-Z]{4})(-)([0-9]{3})([A-Z](?![A-Za-z]))?", text.read())

These expressions work correctly but I have a problem when Ids are connected to words eg XKCD-999James returns XKCD-999 which is correct but if the id is XKCD-999KEight it returns XKCD-999 while the correct answer is XKCD-999K这些表达式可以正常工作，但是当 Id 连接到单词时出现问题，例如XKCD-999James返回XKCD-999这是正确的，但如果 id 是XKCD-999KEight它返回XKCD-999而正确答案是XKCD-999K

So basically I need an approach to separate any alpha characters connected to Words in a given id所以基本上我需要一种方法来分离与给定 id 中的单词相关的任何字母字符

What will be the correct approach for the following problem?以下问题的正确方法是什么？

带有连接词的正则表达式

问题描述

1 个解决方案

解决方案1
0 2021-03-29 17:31:07

带有连接词的正则表达式

问题描述

1 个解决方案

解决方案1 0 2021-03-29 17:31:07

解决方案1
0 2021-03-29 17:31:07