简体   繁体   English

从“伪”HTML 中提取文本

[英]Extracting Text from “Pseudo” HTML

I am trying to rebuild work orders from a Manufacturing Execution System (MES) SQL database into.pdf form so that they can be printed en masse--as opposed to one at a time (one at a time is the only means the MES allows for).我正在尝试将制造执行系统 (MES) SQL 数据库中的工作订单重建为.pdf 表格,以便可以批量打印它们 - 而不是一次打印一个(一次一个是 MES 允许的唯一方式为了)。

I am stuck when it comes to the work instructions that contain links and etc (the pseudo-html...not sure what else to call it).当涉及到包含链接等的工作说明时,我被卡住了(伪 html ......不知道还能叫什么)。 I run the SQL query for the data needed and put it into a Pandas dataframe.我对所需数据运行 SQL 查询并将其放入 Pandas dataframe 中。 The following is an example of the "Text" column (the work instructions) in the dataframe:以下是 dataframe 中“文本”栏(工作说明)的示例:

 "DWG/TECH DATA: ALL TASK WITHIN THIS WORK ORDER ARE TO BE ACCOMPLISHED IAW: <#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJECTID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: PANEL, |',@Caption=DWG 123456-123,REF_ID=REFID))""><#Tab> MOA DWG: <#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJECT ID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: FACEPLATES |',@Caption=DWG 98765 Plate,REF_ID=REFID))""> <#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: ARTWORK |',@Caption=DWG 9999-8888 ARTWORK,REF_ID=REFID))""><#Tab>"

The data I am trying to return should look something like this:我试图返回的数据应该是这样的:

DWG/TECH DATA: ALL TASK WITHIN THIS WORK ORDER ARE TO BE ACCOMPLISHED IAW: DWG/技术数据:本工单中的所有任务都将完成 IAW:

DWG 123456-123 DWG 123456-123

MOA DWG:恐鸟 DWG:

DWG 98765 Plate DWG 98765 板
DWG 9999-8888 ARTWORK DWG 9999-8888 艺术品

The information there tends to have a lot of copy paste inserted to it;那里的信息往往会插入很多复制粘贴; so finding patterns proved too difficult for my regular expression skills.所以发现模式对我的正则表达式技能来说太难了。 Essentially, I think it can happen if everything between a "<" and ">" gets deleted -- Except if it is between a "@Caption=" and ",".本质上,我认为如果“<”和“>”之间的所有内容都被删除,就会发生这种情况——除非它在“@Caption=”和“,”之间。

I also tried to extract the text with beautifulsoup but the caption never came out.我还尝试使用 beautifulsoup 提取文本,但标题从未出现。

Any advice or help would be greatly appreciated.任何建议或帮助将不胜感激。

With string manipulation (not regex), something along these lines works:使用字符串操作(不是正则表达式),可以按照以下方式进行操作:

work = '''DWG/TECH DATA: ALL TASK WITHIN THIS WORK ORDER ARE TO BE ACCOMPLISHED IAW:
<#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJECTID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: PANEL, |',@Caption=DWG 123456-123 ,REF_ID=REFID))""><#Tab>
MOA DWG:
<#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJECT ID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: FACEPLATES |',@Caption=DWG 98765 Plate,REF_ID=REFID))"">
<#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: ARTWORK |',@Caption=DWG 9999-8888 ARTWORK ,REF_ID=REFID))""><#Tab>"
'''

work_dat = work.splitlines()
for line in work_dat:
    line_lst = line.split('|')
    step_1 = [item  if "@Caption=" in item else line_lst for item in line_lst][0]
    step_2 = [item if len(step_1)==1 else step_1[2] for item in step_1]
    if len(step_2)>1:
        print(step_2[1].split('=')[1].split(',')[0].strip())
    else:
        print(step_2[0])

Output: Output:

DWG/TECH DATA: ALL TASK WITHIN THIS WORK ORDER ARE TO BE ACCOMPLISHED IAW:
DWG 123456-123
MOA DWG:
DWG 98765 Plate
DWG 9999-8888 ARTWORK

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM