简体   繁体   中英

Parse “quoted-printable” encoded text

I have some text encoded with quoted-printable, in which soft-breaks are made with the = symbol. I'm looking to parse (not decode) this text. Is there any way I can read the following,

<span style=3D"text-decoration: line-through; color: rgb(156, 163, 173);">8=
/23/2017-&nbsp;&nbsp;Lorem ipsum dolor sit amet, fastidii sad.Vim graece&nb=
sp; tractatos

As this:

8/23/2017-        Lorem ipsum dolor sit amet, fastidii sad.Vim graece    tractatos

Seems that this should be simple enough with the re module (this is untested and from memory:

import re

test_str = """<span style=3D"text-decoration: line-through; color: rgb(156, 163, 173);">8=
/23/2017-&nbsp;&nbsp;Lorem ipsum dolor sit amet, fastidii sad.Vim graece&nb=
sp; tractatos"""

re.sub(r'=$', r'\n', test_str, flags=re.MULTILINE)

But since you are asking to parse it. What would you like to retrieve? Parsing usually means that you will extract structured data, therefore your input should be according to some grammar (seems like it is):

  • first field is a date (in a certain format)
  • second field a message
  • third field (looks like thre's a third field): category

EDIT:

Most simple form:

import quopri
from HTMLParser import HTMLParser

test_str = """<span style=3D"text-decoration: line-through; color: rgb(156, 163, 173);">8=
/23/2017-&nbsp;&nbsp;Lorem ipsum dolor sit amet, fastidii sad.Vim graece&nb=
sp; tractatos"""

h = HTMLParser()
print h.unescape(quopri.decodestring(test_str))

A parser might be overkill for this problem, but pyparsing is an easy parsing library to handle some of the trickier rules. Also, it comes with some HTML tag expressions already built in:

import pyparsing as pp

sample = """\
<span style=3D"text-decoration: line-through; color: rgb(156, 163, 173);">8=
/23/2017-&nbsp;&nbsp;Lorem ipsum dolor sit amet, fastidii sad.Vim graece&nb=
sp; tractatos"""

# strip all trailing '='
sample = sample.replace("=\n", "")

# convert =XX to char(int(XX)), like =3D -> '='
hex_escape = pp.Regex(r'=[0-9a-fA-F]{2}')
hex_escape.setParseAction(lambda t: chr(int(t[0][1:], 16)))
sample = hex_escape.transformString(sample)

# convert HTML entities like &nbsp; and suppress all opening and closing HTML tags
pp.commonHTMLEntity.setParseAction(pp.replaceHTMLEntity)
stripper = pp.anyOpenTag.suppress() | pp.anyCloseTag.suppress() | pp.commonHTMLEntity

Use the stripper to transform your input string:

stripped = stripper.transformString(sample)
print(stripped)

Prints

8/23/2017-  Lorem ipsum dolor sit amet, fastidii sad.Vim graece  tractatos

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM