简体   繁体   English

如何从 python 中的关键字开始并以不同关键字结尾的字符串中提取特定行?

[英]How do I extract specific lines from a string starting from a keyword and ending at a different keyword in python?

The goal of my code is to be able to take text from a word document and take lines for every instance that there is a keyword until the associated part number, so for example:我的代码的目标是能够从 word 文档中获取文本,并为每个有关键字的实例获取行,直到相关的部件号,例如:

The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component,处理器204执行以下操作中的一个或多个:由传输中的组件检测另一个组件已被移除244C,由该组件检测,

Would become:会成为:

detecting, by a component in a transport, that another component has been removed 244C由运输工具中的一个组件检测到另一个组件已被移除 244C

In addition to this, I need to take that text, and center it within an image that I've created with my code.除此之外,我需要获取该文本,并将其置于我用我的代码创建的图像中。 Here is my code:这是我的代码:

import re
import time
import textwrap
from docx import Document
from PIL import Image, ImageFont, ImageDraw

doc = Document('PatentDocument.docx')
docText = ''.join(paragraph.text for paragraph in doc.paragraphs)
print(docText)

for i, p in enumerate(docText):
    W, H = 300, 300
    body = Image.new('RGB', (W, H), (255, 255, 255))
    border = Image.new('RGB', (W + 2, H + 2), (0, 0, 0))
    border.save('border.png')
    body.save('body.png')
    patent = Image.open('border.png')
    patent.paste(body, (1, 1))
    draw = ImageDraw.Draw(patent)
    font = ImageFont.load_default()

    current_h, pad = 60, 20
    keywords = ['responsive', 'detecting', 'providing', 'Responsive', 'Detecting', 'Providing']
    pattern = re.compile('|'.join(keywords))
    parts = re.findall("\d{1,3}[C]", docText)
    print(parts)
    for keywords in textwrap.wrap(docText, width=50):
        line = keywords.encode('utf-8')
        w, h = draw.textsize(line, font=font)
        draw.text(((W-w)/2, current_h), line, (0, 0, 0), font=font)
        current_h += h + pad

    patent.save(f'patent_{i+1}_{time.strftime("%Y%m%d%H%M%S")}.png')

What my code currently does is print the the string that is the entirety of the text from the word document, and outputs an image of the entire text 500+ times, which Is the character count in of the string.我的代码当前所做的是从 word 文档中打印作为整个文本的字符串,并输出整个文本的图像 500+ 次,这是字符串中的字符数。 Here is an example of one of my outputs:这是我的输出之一的示例:

示例输出

This output is repeated 500+ times.这个 output 重复了 500 多次。 In addition to that, these get output in the run window:除此之外,这些在运行 window 中得到 output:

[0054] The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component, that a replacement component has been added in the transport 246C, providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C, and responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C. [0054] 处理器204执行以下一项或多项操作:由运输工具中的组件检测另一组件已被移除244C、由组件检测替换组件已被添加到运输工具246C、由组件,数据到替换组件,其中数据试图破坏替换组件的授权功能248C,并且响应于授权功能的非破坏,允许组件使用替换组件的授权功能249C。 ['244C', '246C', '248C', '249C'] ['244C', '246C', '248C', '249C']

Except, that array that followed the paragraph is repeated 500+ times as well.除此之外,该段落后面的数组也重复了 500 多次。

This is the word document that I'm reading from and converting into a single string:这是我正在读取并转换为单个字符串的 word 文档:

[0054] The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component, that a replacement component has been added in the transport 246C, providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C, and responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C. [0054] 处理器204执行以下一项或多项操作:由运输工具中的组件检测另一组件已被移除244C、由组件检测替换组件已被添加到运输工具246C、由组件,数据到替换组件,其中数据试图破坏替换组件的授权功能248C,并且响应于授权功能的非破坏,允许组件使用替换组件的授权功能249C。

I currently want to know how to extract the specific lines from the string I made.我目前想知道如何从我制作的字符串中提取特定的行。 The output should look like this--ignoring the boxes and the centering--I'm only looking to output those lines from the paragraph I gave: output 应该看起来像这样——忽略框和居中——我只看 output 我给出的段落中的那些行:

想出去

Some pseudo code for this would be something like:一些伪代码类似于:

for keyword in docText:
     print({keyword, part number})

My current implementation is with docx, PIL and re, though I'm happy to use anything that will accomplish my goals.我目前的实现是使用 docx、PIL 和 re,尽管我很乐意使用任何可以实现我的目标的东西。 Anything helps!什么都有帮助!

So, after some help from an outside source I managed to get it all sorted out.因此,在外部资源的一些帮助之后,我设法把这一切都解决了。 Minus the code for outputting to images with centered text and all that, this is the code that works to solve my main issue:减去用于输出到带有居中文本的图像的代码以及所有这些,这是可以解决我的主要问题的代码:

from docx import Document
from PIL import Image, ImageFont, ImageDraw

doc = Document('PatentDocument.docx')
docText = ''.join(paragraph.text for paragraph in doc.paragraphs)
print(docText)


def get(source, begin, end):
    try:
        start = source.index(len(begin)) + len(begin)
        finish = source.index(len(end), len(start))
        return source[start:finish]
    except ValueError:
        return ""


def create_regex(keywords=('responsive', 'providing', 'detecting')):
    re.compile('([Rr]esponsive|[Pp]oviding|[Dd]etecting).*?(\\d{1,3}C)')
    regex = (
        "("
        + "|".join((f"[{k[0].upper()}{k[0].lower()}]{k[1:]}" for k in keywords))
        + ")"
        + ".*?(\\d{1,3}C)"
    )
    return re.compile(regex)


def find_matches(text, keywords):
    return [m.group() for m in re.finditer(create_regex(keywords), text)]


for match in find_matches(
    text=docText, keywords=("responsive", "detecting", "providing")
):
    print(match)

So, from the source document:所以,从源文件:

[0054] The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component, that a replacement component has been added in the transport 246C, providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C, and responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C. [0054] 处理器204执行以下一项或多项操作:由运输工具中的组件检测另一组件已被移除244C、由组件检测替换组件已被添加到运输工具246C、由组件,数据到替换组件,其中数据试图破坏替换组件的授权功能248C,并且响应于授权功能的非破坏,允许组件使用替换组件的授权功能249C。

I get the following output:我得到以下 output:

[0054] The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component, that a replacement component has been added in the transport 246C, providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C, and responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C. [0054] 处理器204执行以下一项或多项操作:由运输工具中的组件检测另一组件已被移除244C、由组件检测替换组件已被添加到运输工具246C、由组件,数据到替换组件,其中数据试图破坏替换组件的授权功能248C,并且响应于授权功能的非破坏,允许组件使用替换组件的授权功能249C。

detecting, by a component in a transport, that another component has been removed 244C由运输工具中的一个组件检测到另一个组件已被移除 244C

detecting, by the component, that a replacement component has been added in the transport 246C由组件检测到已在传输器 246C 中添加了替换组件

providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C由组件向替换组件提供数据,其中数据试图破坏替换组件的授权功能248C

responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C响应授权功能的非颠覆,允许组件使用替换组件的授权功能249C

The string that's printed followed by the keyword strings have no spaces between them, but for ease of reading, I've separated them as such.打印的字符串后面跟着关键字字符串,它们之间没有空格,但为了便于阅读,我将它们分开了。 Hope this can help someone else out!希望这可以帮助别人!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 python 从特定关键字中提取有限的数据行 - How to extract limited lines of data from specific keyword using python Python:如何删除以多行关键字开头的部分字符串? - Python: How to remove part of a string starting at a keyword for multiple lines? 如何在 python 中的关键字之后从字符串中提取浮点数 - How to extract a float from a string after a keyword in python 如何搜索关键字并打印文件,只打印关键字中的行 - How do I search a keyword and print a file with only print the lines from the keyword on 如何使用开始和结束条件从文本中读取特定行? - How to read specific lines from text using a starting and ending condition? 如何从长字符串中提取以序列开头并以空格结尾的字符串 - how to extract string starting with a sequence and ending with a space from a long string 如何使用 python 中的 pandas 从我的 json 数据集中提取包含特定关键字的特定行? - how can I extract specific row which contain specific keyword from my json dataset using pandas in python? 如何从python中的字符串中检索关键字参数? - How retrieve keyword arguments from a string in python? 如何从用户在python中的输入中找到关键字? - How do I find a keyword from user's input in python? 从字符串中提取出现在关键字之前的单词/句子 - Python - Extract words/sentence that occurs before a keyword from a string - Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM