简体   繁体   English

使用 Python 标记和标记 HTML 源代码

[英]Tokenize and label HTML source code using Python

I have some annotated HTML source code, where the code is similar to what you would get using requests and the annotations are labels with the character index where the labelled item starts and我有一些带注释的 HTML 源代码,其中的代码类似于您使用requests获得的代码,并且注释是带有标记项开始处的字符索引的标签,以及

For example, the source code could be:例如,源代码可以是:

<body><text>Hello world!</text><text>This is my code. And this is a number 42</text></body>

and the labels could be for example:标签可以是例如:

[{'label':'salutation', 'start':12, 'end':25},
 {'label':'verb', 'start':42, 'end':45},
 {'label':'size', 'start':75, 'end':78}]

Referring to the words 'Hello world', 'is' and '42' respectively.分别参考单词“Hello world”、“is”和“42”。 We know in advance that the labels are not overlapping.我们事先知道标签不重叠。

I want to process the source code and the annotations to produce a list of tokens appropriate for the HTML format.我想处理源代码和注释以生成适合 HTML 格式的标记列表。

For example, it could produce here something like this:例如,它可以在这里产生这样的东西:

['<body>', '<text>', 'hello', 'world', '</text>', '<text>', 'this', 'is', 'my', 'code', 'and', 'this', 'is', 'a', 'number', '[NUMBER]', '</text>', '</body>']

Furthermore, it must map the annotations to the tokenization, producing a sequence of labels of the same length as the tokenization such as:此外,它必须将注释映射到标记化,生成与标记化长度相同的标签序列,例如:

['NONE', 'NONE', 'salutation', 'salutation', 'NONE', 'NONE', 'NONE', 'verb', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'size', 'NONE', 'NONE']

What is the easiest way of accomplishing this in Python?在 Python 中完成此任务的最简单方法是什么?

You can use recursion with BeautifulSoup to produce a list of all tags and content, which can then be used to match the labels:您可以使用BeautifulSoup递归生成所有标签和内容的列表,然后可用于匹配标签:

from bs4 import BeautifulSoup as soup
import re
content = '<body><text>Hello world!</text><text>This is my code. And this is a number 42</text></body>'
def tokenize(d):
  yield f'<{d.name}>'
  for i in d.contents:
     if not isinstance(i, str):
       yield from tokenize(i)
     else:
       yield from i.split()
  yield f'</{d.name}>'

data = list(tokenize(soup(content, 'html.parser').body))

Output:输出:

['<body>', '<text>', 'Hello', 'world!', '</text>', '<text>', 'This', 'is', 'my', 'code.', 'And', 'this', 'is', 'a', 'number', '42', '</text>', '</body>']

Then, to match the labels:然后,匹配标签:

labels = [{'label':'salutation', 'start':12, 'end':25}, {'label':'verb', 'start':42, 'end':45}, {'label':'size', 'start':75, 'end':78}] 
tokens = [{**i, 'word':content[i['start']:i['end']-1].split()} for i in labels]
indices = {i:iter([[c, c+len(i)+1] for c in range(len(content)) if re.findall('^\W'+i, content[c-1:])]) for i in data}  
new_data = [[i, next(indices[i], None)] for i in data]
result = [(lambda x:'NONE' if not x else x[0])([c['label'] for c in tokens if b and c['start'] <= b[0] and b[-1] <= c['end']]) for a, b in new_data]

Output:输出:

['NONE', 'NONE', 'salutation', 'salutation', 'NONE', 'NONE', 'NONE', 'verb', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'size', 'NONE', 'NONE']

For the time being I have made this work using HTMLParser:目前我已经使用 HTMLParser 完成了这项工作:

from html.parser import HTMLParser
from tensorflow.keras.preprocessing.text import text_to_word_sequence

class HTML_tokenizer_labeller(HTMLParser):
  def __init__(self, annotations, *args, **kwargs):
    super(HTML_tokenizer_labeller, self).__init__(*args, **kwargs)
    self.tokens = []
    self.labels = []
    self.annotations = annotations

  def handle_starttag(self, tag, attrs):
    self.tokens.append(f'<{tag}>')
    self.labels.append('OTHER')

  def handle_endtag(self, tag):
    self.tokens.append(f'</{tag}>')
    self.labels.append('OTHER')

  def handle_data(self, data):
    print(f"getpos = {self.getpos()}")
    tokens = text_to_word_sequence(data)

    pos = self.getpos()[1]
    for annotation in annotations:
      if annotation['start'] <= pos <= annotation['end']:
        label = annotation['tag']
        break
    else: label = 'OTHER'

    self.tokens += tokens
    self.labels += [label] * len(tokens)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM