[英]How do I extract specific lines from a string starting from a keyword and ending at a different keyword in python?
The goal of my code is to be able to take text from a word document and take lines for every instance that there is a keyword until the associated part number, so for example:我的代码的目标是能够从 word 文档中获取文本,并为每个有关键字的实例获取行,直到相关的部件号,例如:
The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component,
处理器204执行以下操作中的一个或多个:由传输中的组件检测另一个组件已被移除244C,由该组件检测,
Would become:会成为:
detecting, by a component in a transport, that another component has been removed 244C
由运输工具中的一个组件检测到另一个组件已被移除 244C
In addition to this, I need to take that text, and center it within an image that I've created with my code.除此之外,我需要获取该文本,并将其置于我用我的代码创建的图像中。 Here is my code:
这是我的代码:
import re
import time
import textwrap
from docx import Document
from PIL import Image, ImageFont, ImageDraw
doc = Document('PatentDocument.docx')
docText = ''.join(paragraph.text for paragraph in doc.paragraphs)
print(docText)
for i, p in enumerate(docText):
W, H = 300, 300
body = Image.new('RGB', (W, H), (255, 255, 255))
border = Image.new('RGB', (W + 2, H + 2), (0, 0, 0))
border.save('border.png')
body.save('body.png')
patent = Image.open('border.png')
patent.paste(body, (1, 1))
draw = ImageDraw.Draw(patent)
font = ImageFont.load_default()
current_h, pad = 60, 20
keywords = ['responsive', 'detecting', 'providing', 'Responsive', 'Detecting', 'Providing']
pattern = re.compile('|'.join(keywords))
parts = re.findall("\d{1,3}[C]", docText)
print(parts)
for keywords in textwrap.wrap(docText, width=50):
line = keywords.encode('utf-8')
w, h = draw.textsize(line, font=font)
draw.text(((W-w)/2, current_h), line, (0, 0, 0), font=font)
current_h += h + pad
patent.save(f'patent_{i+1}_{time.strftime("%Y%m%d%H%M%S")}.png')
What my code currently does is print the the string that is the entirety of the text from the word document, and outputs an image of the entire text 500+ times, which Is the character count in of the string.我的代码当前所做的是从 word 文档中打印作为整个文本的字符串,并输出整个文本的图像 500+ 次,这是字符串中的字符数。 Here is an example of one of my outputs:
这是我的输出之一的示例:
This output is repeated 500+ times.这个 output 重复了 500 多次。 In addition to that, these get output in the run window:
除此之外,这些在运行 window 中得到 output:
[0054] The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component, that a replacement component has been added in the transport 246C, providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C, and responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C.
[0054] 处理器204执行以下一项或多项操作:由运输工具中的组件检测另一组件已被移除244C、由组件检测替换组件已被添加到运输工具246C、由组件,数据到替换组件,其中数据试图破坏替换组件的授权功能248C,并且响应于授权功能的非破坏,允许组件使用替换组件的授权功能249C。 ['244C', '246C', '248C', '249C']
['244C', '246C', '248C', '249C']
Except, that array that followed the paragraph is repeated 500+ times as well.除此之外,该段落后面的数组也重复了 500 多次。
This is the word document that I'm reading from and converting into a single string:这是我正在读取并转换为单个字符串的 word 文档:
[0054] The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component, that a replacement component has been added in the transport 246C, providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C, and responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C.
[0054] 处理器204执行以下一项或多项操作:由运输工具中的组件检测另一组件已被移除244C、由组件检测替换组件已被添加到运输工具246C、由组件,数据到替换组件,其中数据试图破坏替换组件的授权功能248C,并且响应于授权功能的非破坏,允许组件使用替换组件的授权功能249C。
I currently want to know how to extract the specific lines from the string I made.我目前想知道如何从我制作的字符串中提取特定的行。 The output should look like this--ignoring the boxes and the centering--I'm only looking to output those lines from the paragraph I gave:
output 应该看起来像这样——忽略框和居中——我只看 output 我给出的段落中的那些行:
Some pseudo code for this would be something like:一些伪代码类似于:
for keyword in docText:
print({keyword, part number})
My current implementation is with docx, PIL and re, though I'm happy to use anything that will accomplish my goals.我目前的实现是使用 docx、PIL 和 re,尽管我很乐意使用任何可以实现我的目标的东西。 Anything helps!
什么都有帮助!
So, after some help from an outside source I managed to get it all sorted out.因此,在外部资源的一些帮助之后,我设法把这一切都解决了。 Minus the code for outputting to images with centered text and all that, this is the code that works to solve my main issue:
减去用于输出到带有居中文本的图像的代码以及所有这些,这是可以解决我的主要问题的代码:
from docx import Document
from PIL import Image, ImageFont, ImageDraw
doc = Document('PatentDocument.docx')
docText = ''.join(paragraph.text for paragraph in doc.paragraphs)
print(docText)
def get(source, begin, end):
try:
start = source.index(len(begin)) + len(begin)
finish = source.index(len(end), len(start))
return source[start:finish]
except ValueError:
return ""
def create_regex(keywords=('responsive', 'providing', 'detecting')):
re.compile('([Rr]esponsive|[Pp]oviding|[Dd]etecting).*?(\\d{1,3}C)')
regex = (
"("
+ "|".join((f"[{k[0].upper()}{k[0].lower()}]{k[1:]}" for k in keywords))
+ ")"
+ ".*?(\\d{1,3}C)"
)
return re.compile(regex)
def find_matches(text, keywords):
return [m.group() for m in re.finditer(create_regex(keywords), text)]
for match in find_matches(
text=docText, keywords=("responsive", "detecting", "providing")
):
print(match)
So, from the source document:所以,从源文件:
[0054] The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component, that a replacement component has been added in the transport 246C, providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C, and responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C.
[0054] 处理器204执行以下一项或多项操作:由运输工具中的组件检测另一组件已被移除244C、由组件检测替换组件已被添加到运输工具246C、由组件,数据到替换组件,其中数据试图破坏替换组件的授权功能248C,并且响应于授权功能的非破坏,允许组件使用替换组件的授权功能249C。
I get the following output:我得到以下 output:
[0054] The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component, that a replacement component has been added in the transport 246C, providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C, and responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C.
[0054] 处理器204执行以下一项或多项操作:由运输工具中的组件检测另一组件已被移除244C、由组件检测替换组件已被添加到运输工具246C、由组件,数据到替换组件,其中数据试图破坏替换组件的授权功能248C,并且响应于授权功能的非破坏,允许组件使用替换组件的授权功能249C。
detecting, by a component in a transport, that another component has been removed 244C
由运输工具中的一个组件检测到另一个组件已被移除 244C
detecting, by the component, that a replacement component has been added in the transport 246C
由组件检测到已在传输器 246C 中添加了替换组件
providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C
由组件向替换组件提供数据,其中数据试图破坏替换组件的授权功能248C
responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C
响应授权功能的非颠覆,允许组件使用替换组件的授权功能249C
The string that's printed followed by the keyword strings have no spaces between them, but for ease of reading, I've separated them as such.打印的字符串后面跟着关键字字符串,它们之间没有空格,但为了便于阅读,我将它们分开了。 Hope this can help someone else out!
希望这可以帮助别人!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.