简体   繁体   中英

Extracting Text from “Pseudo” HTML

I am trying to rebuild work orders from a Manufacturing Execution System (MES) SQL database into.pdf form so that they can be printed en masse--as opposed to one at a time (one at a time is the only means the MES allows for).

I am stuck when it comes to the work instructions that contain links and etc (the pseudo-html...not sure what else to call it). I run the SQL query for the data needed and put it into a Pandas dataframe. The following is an example of the "Text" column (the work instructions) in the dataframe:

 "DWG/TECH DATA: ALL TASK WITHIN THIS WORK ORDER ARE TO BE ACCOMPLISHED IAW: <#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJECTID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: PANEL, |',@Caption=DWG 123456-123,REF_ID=REFID))""><#Tab> MOA DWG: <#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJECT ID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: FACEPLATES |',@Caption=DWG 98765 Plate,REF_ID=REFID))""> <#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: ARTWORK |',@Caption=DWG 9999-8888 ARTWORK,REF_ID=REFID))""><#Tab>"

The data I am trying to return should look something like this:

DWG/TECH DATA: ALL TASK WITHIN THIS WORK ORDER ARE TO BE ACCOMPLISHED IAW:

DWG 123456-123

MOA DWG:

DWG 98765 Plate
DWG 9999-8888 ARTWORK

The information there tends to have a lot of copy paste inserted to it; so finding patterns proved too difficult for my regular expression skills. Essentially, I think it can happen if everything between a "<" and ">" gets deleted -- Except if it is between a "@Caption=" and ",".

I also tried to extract the text with beautifulsoup but the caption never came out.

Any advice or help would be greatly appreciated.

With string manipulation (not regex), something along these lines works:

work = '''DWG/TECH DATA: ALL TASK WITHIN THIS WORK ORDER ARE TO BE ACCOMPLISHED IAW:
<#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJECTID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: PANEL, |',@Caption=DWG 123456-123 ,REF_ID=REFID))""><#Tab>
MOA DWG:
<#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJECT ID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: FACEPLATES |',@Caption=DWG 98765 Plate,REF_ID=REFID))"">
<#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: ARTWORK |',@Caption=DWG 9999-8888 ARTWORK ,REF_ID=REFID))""><#Tab>"
'''

work_dat = work.splitlines()
for line in work_dat:
    line_lst = line.split('|')
    step_1 = [item  if "@Caption=" in item else line_lst for item in line_lst][0]
    step_2 = [item if len(step_1)==1 else step_1[2] for item in step_1]
    if len(step_2)>1:
        print(step_2[1].split('=')[1].split(',')[0].strip())
    else:
        print(step_2[0])

Output:

DWG/TECH DATA: ALL TASK WITHIN THIS WORK ORDER ARE TO BE ACCOMPLISHED IAW:
DWG 123456-123
MOA DWG:
DWG 98765 Plate
DWG 9999-8888 ARTWORK

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM