简体   繁体   中英

Tokenize and label HTML source code using Python

I have some annotated HTML source code, where the code is similar to what you would get using requests and the annotations are labels with the character index where the labelled item starts and

For example, the source code could be:

<body><text>Hello world!</text><text>This is my code. And this is a number 42</text></body>

and the labels could be for example:

[{'label':'salutation', 'start':12, 'end':25},
 {'label':'verb', 'start':42, 'end':45},
 {'label':'size', 'start':75, 'end':78}]

Referring to the words 'Hello world', 'is' and '42' respectively. We know in advance that the labels are not overlapping.

I want to process the source code and the annotations to produce a list of tokens appropriate for the HTML format.

For example, it could produce here something like this:

['<body>', '<text>', 'hello', 'world', '</text>', '<text>', 'this', 'is', 'my', 'code', 'and', 'this', 'is', 'a', 'number', '[NUMBER]', '</text>', '</body>']

Furthermore, it must map the annotations to the tokenization, producing a sequence of labels of the same length as the tokenization such as:

['NONE', 'NONE', 'salutation', 'salutation', 'NONE', 'NONE', 'NONE', 'verb', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'size', 'NONE', 'NONE']

What is the easiest way of accomplishing this in Python?

You can use recursion with BeautifulSoup to produce a list of all tags and content, which can then be used to match the labels:

from bs4 import BeautifulSoup as soup
import re
content = '<body><text>Hello world!</text><text>This is my code. And this is a number 42</text></body>'
def tokenize(d):
  yield f'<{d.name}>'
  for i in d.contents:
     if not isinstance(i, str):
       yield from tokenize(i)
     else:
       yield from i.split()
  yield f'</{d.name}>'

data = list(tokenize(soup(content, 'html.parser').body))

Output:

['<body>', '<text>', 'Hello', 'world!', '</text>', '<text>', 'This', 'is', 'my', 'code.', 'And', 'this', 'is', 'a', 'number', '42', '</text>', '</body>']

Then, to match the labels:

labels = [{'label':'salutation', 'start':12, 'end':25}, {'label':'verb', 'start':42, 'end':45}, {'label':'size', 'start':75, 'end':78}] 
tokens = [{**i, 'word':content[i['start']:i['end']-1].split()} for i in labels]
indices = {i:iter([[c, c+len(i)+1] for c in range(len(content)) if re.findall('^\W'+i, content[c-1:])]) for i in data}  
new_data = [[i, next(indices[i], None)] for i in data]
result = [(lambda x:'NONE' if not x else x[0])([c['label'] for c in tokens if b and c['start'] <= b[0] and b[-1] <= c['end']]) for a, b in new_data]

Output:

['NONE', 'NONE', 'salutation', 'salutation', 'NONE', 'NONE', 'NONE', 'verb', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'size', 'NONE', 'NONE']

For the time being I have made this work using HTMLParser:

from html.parser import HTMLParser
from tensorflow.keras.preprocessing.text import text_to_word_sequence

class HTML_tokenizer_labeller(HTMLParser):
  def __init__(self, annotations, *args, **kwargs):
    super(HTML_tokenizer_labeller, self).__init__(*args, **kwargs)
    self.tokens = []
    self.labels = []
    self.annotations = annotations

  def handle_starttag(self, tag, attrs):
    self.tokens.append(f'<{tag}>')
    self.labels.append('OTHER')

  def handle_endtag(self, tag):
    self.tokens.append(f'</{tag}>')
    self.labels.append('OTHER')

  def handle_data(self, data):
    print(f"getpos = {self.getpos()}")
    tokens = text_to_word_sequence(data)

    pos = self.getpos()[1]
    for annotation in annotations:
      if annotation['start'] <= pos <= annotation['end']:
        label = annotation['tag']
        break
    else: label = 'OTHER'

    self.tokens += tokens
    self.labels += [label] * len(tokens)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM