简体   繁体   English

检测特定模式正则表达式python

[英]Detect specific pattern regex python

I want to find all the occurrences of an specific term (and its variations) in a word document.我想在 word 文档中查找特定术语(及其变体)的所有出现。

  1. Extracted the text from the word document从word文档中提取文本
  2. Try to find pattern via regex尝试通过正则表达式查找模式

The pattern consists of words that start with DOC- and after the - there are 9 digits.该模式由以 DOC- 开头的单词组成,在 - 之后有 9 位数字。

I have tried the following without success:我尝试了以下但没有成功:

document variable is the extracted text with the following function:文档变量是具有以下功能的提取文本:

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)
  1. pattern = re.compile('^DOC.\\d{9}$') pattern = re.compile('^DOC.\\d{9}$')
  2. pattern.findall(document) pattern.findall(文档)

pattern.findall(document) pattern.findall(文档)

Can someone help me?有人能帮我吗?

Thanks in advance提前致谢

You can use a combinbation of word and numeric right-hand boundaries.您可以使用单词和数字右侧边界的组合。

Also, you say there must be a dash after DOC , but you use a .另外,您说DOC之后必须有一个破折号,但是您使用了. in the pattern.在图案中。 I believe you wanted to also match any en- or em-dash, so I'd suggest to use a more precise pattern, like [-–—] .我相信您还想匹配任何 en- 或 em-dash,因此我建议使用更精确的模式,例如[-–—] Note there are other ways to match any Unicode dash char, see Searching for all Unicode variation of hyphens in Python .请注意,还有其他方法可以匹配任何 Unicode 破折号字符,请参阅在 Python 中搜索所有连字符的 Unicode 变体

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

print( re.findall(r'\bDOC[-–—]\d{9}(?!\d)', getText(filename)) )

Details :详情

  • \\b - a word boundary \\b - 单词边界
  • DOC - DOC substring DOC - DOC子串
  • [-–—] - a dash symbol (hyphen, en- or em-dash) [-–—] - 破折号(连字符、en- 或 em-破折号)
  • \\d{9} - nine digits \\d{9} - 九位数
  • (?!\\d) - immediately to the right of the current location, there must be no digit. (?!\\d) - 在当前位置的右侧,不能有数字。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM