简体   繁体   中英

Extracting string from html text

I am getting html with curl and need to extract only the second table statement . Mind that the curled html is a single string and not formated. For better explaination see the following: (... stands for more html)

...
<table width="100%" cellpadding="0" cellspacing="0" class="table">
...
</table>
...
#I need to extract the following table
#from here
<table width="100%" cellpadding="4">
...
</table> #to this
...

I tried multiple SED lines so far, also I think that trying to match the second table like this is not the smooth way:

sed -n '/<table width="100%" cellpadding="4"/,/table>/p'

An html parser would be better, but you can use awk like this:

awk '/<table width="100%" cellpadding="4">/ {f=1} f; /<\/table>/ {f=0}' file
<table width="100%" cellpadding="4">
...
</table> #to this
  • /<table width="100%" cellpadding="4">/ {f=1} when start is found set flag f to true
  • f; if flage f is true, do default action, print line.
  • /<\/table>/ {f=0} when end is found, clear flag f to stop print.

This could also be used, but like the flag control better:

awk '/<table width="100%" cellpadding="4">/,/<\/table>/' file
<table width="100%" cellpadding="4">
...
</table> #to this

Save the script below as script.py and run it like this:

python3 script.py input.html

This script parses the HTML and checks for the attributes ( width and cellpadding ). The advantage of this approach is that if you change the formatting of the HTML file it will still work because the script parses the HTML instead of relying on exact string matching.

from html.parser import HTMLParser
import sys

def print_tag(tag, attrs, end=False):
    line = "<" 
    if end:
        line += "/"
    line += tag
    for attr, value in attrs:
        line += " " + attr + '="' + value + '"'
    print(line + ">", end="")

if len(sys.argv) < 2:
    print("ERROR: expected argument - filename")
    sys.exit(1)

with open(sys.argv[1], 'r', encoding='cp1252') as content_file:
    content = content_file.read()

do_print = False

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        global do_print
        if tag == "table":
            if ("width", "100%") in attrs and ("cellpadding", "4") in attrs:
                do_print = True
        if do_print:
            print_tag(tag, attrs)

    def handle_endtag(self, tag):
        global do_print
        if do_print:
            print_tag(tag, attrs=(), end=True)
            if tag == "table":
                do_print = False

    def handle_data(self, data):
        global do_print
        if do_print:
            print(data, end="")

parser = MyHTMLParser()
parser.feed(content)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM