Handle malformed HTML (no closing tags)

Question

I use BeautifulSoup to parse HTML via lxml parser. But i encountered a file which doesn't have ANY closing tags inside a <table> :

<table id='reportTable' class='report-table' style='width:auto' cellspacing='0'><tr>
<th>Номер<br>поезда<th>Дата<br>отправления<th>Маршрут<th>Причина<th>Комментарий<th>Станция ...

Though the <table> tag is properly closed.

Answer 1

Personally I have come across this problem myself and I run the whole document through HTMLTidy 5 using tidylib. Saying that I agree with C. Feenstra lxml parser can tolerate malformed html. If you have got html that you really can't parse with lxml parser then try this:

from tidylib import tidy_document

badHtml = "<table id='reportTable' class='report-table' style='width:auto' cellspacing='0'><tr><th>Номер<br>поезда<th>Дата<br>отправления<th>Маршрут<th>Причина<th>Комментарий<th>Станция ..."
options = {"output-bom": 0, "quiet": False, "word-2000": True,
           "output-encoding": 'utf8', "output-xhtml": 1, "add-xml-decl": 0,
           "tidy-mark": 0, "drop-proprietary-attributes": True,
           "show-warnings": False, }
tidiedHtml, errors = tidy_document(badHtml, options)

Then use "tidiedHtml" for BeautifulSoup

Handle malformed HTML (no closing tags)

Question

1 answers

solution1
0 ACCPTED 2017-11-23 23:02:58

Handle malformed HTML (no closing tags)

Question

1 answers

solution1 0 ACCPTED 2017-11-23 23:02:58

solution1
0 ACCPTED 2017-11-23 23:02:58