简体   繁体   English

如何使用python从HTML代码中提取特定元素

[英]How can I extract a specific element from HTML code using python

I'm not so confident with HTML language and I'm having trouble in parsing this portion of HTML code (result of print soup.prettify() ) with Python. 我对HTML语言不是很自信,并且在使用Python解析HTML代码的这一部分(print soup.prettify()的结果)时遇到了麻烦。

 $("#global-flash").html(""); $('#reviews-tab-navigation').trigger('repaint'); $('#edit-review-tab').html(' <div class='\\"row-fluid\\"'> \\n <div class='\\"span3\\"'> \\n <div class='\\"label' full-height="" id='\\"review-search-result-panel\\"' use-bootstrap-tables\\"=""> \\n <span class='\\"panel-headline\\"'> Rezensionsdaten&lt;\\/span&gt;\\n <hr/> \\n\\n <table class='\\"table' id='\\"review-search-result-list\\"' table-hover="" table-striped\\"=""> \\n <thead> \\n <tr> \\n <th> \\n <span class='\\"review-count\\"'> 5&lt;\\/span&gt;\\n\\n Rezensionen gefunden\\n &lt;\\/th&gt;\\n &lt;\\/tr&gt;\\n &lt;\\/thead&gt;\\n\\n <tbody> \\n <tr> \\n <td class='\\"selectable-review-entry\\"' data-mastertstyle-id='\\"\\"' data-review-id='\\"10613555\\"'> \\n <span btn-link="" btn-small="" class='\\"btn' review-list-link\\"=""> \\n 5\\n <img 2015\\"="" alt='\\"Bewertung' src='\\"http://bp-webtools1.otto.boreus.de/tools/images/app/reviews/bewertung_stern_2015.png\\"' stern=""/> \\n\\n <span aderisce="" anzi="" bene="" colore="" come="" difettucci="" e="" foto,="" i="" in="" morbidissima,="" non="" pelle,="" piacevole="" rotolini.\\"="" segnare="" senza="" stringe="" sulla="" title='\\"Bel'> Bel colore come in foto, morbidissima, piacevole sulla pelle, non stringe anzi aderisce bene senza segnare i difettucci ei rotolini.&lt;\\/span&gt;\\n &lt;\\/span&gt;\\n &lt;\\/td&gt;\\n &lt;\\/tr&gt;\\n <tr> \\n <td class='\\"selectable-review-entry\\"' data-mastertstyle-id='\\"\\"' data-review-id='\\"10610141\\"'> \\n <span btn-link="" btn-small="" class='\\"btn' review-list-link\\"=""> \\n 5\\n <img 2015\\"="" alt='\\"Bewertung' src='\\"http://bp-webtools1.otto.boreus.de/tools/images/app/reviews/bewertung_stern_2015.png\\"' stern=""/> \\n\\n <span title='\\"bella\\"'> bella&lt;\\/span&gt;\\n &lt;\\/span&gt;\\n &lt;\\/td&gt;\\n &lt;\\/tr&gt;\\n <tr> \\n <td class='\\"selectable-review-entry\\"' data-mastertstyle-id='\\"\\"' data-review-id='\\"10575319\\"'> \\n <span btn-link="" btn-small="" class='\\"btn' review-list-link\\"=""> \\n 4\\n <img 2015\\"="" alt='\\"Bewertung' src='\\"http://bp-webtools1.otto.boreus.de/tools/images/app/reviews/bewertung_stern_2015.png\\"' stern=""/> \\n\\n <span buona="" morbido.\\"="" qualità-prezzo,="" rapporto="" tessuto="" title='\\"Buon' vestibilità,=""> Buon rapporto qualità-prezzo, buona vestibilità, tessuto morbido.&lt;\\/span&gt;\\n &lt;\\/span&gt;\\n &lt;\\/td&gt;\\n &lt;\\/tr&gt;\\n <tr> \\n <td class='\\"selectable-review-entry\\"' data-mastertstyle-id='\\"\\"' data-review-id='\\"10554514\\"'> \\n <span btn-link="" btn-small="" class='\\"btn' review-list-link\\"=""> \\n 5\\n <img 2015\\"="" alt='\\"Bewertung' src='\\"http://bp-webtools1.otto.boreus.de/tools/images/app/reviews/bewertung_stern_2015.png\\"' stern=""/> \\n\\n <span buon="" capo!="" giusto="" ottima="" peso\\"="" qualità,="" title='\\"Davvero' un=""> Davvero un buon capo! Ottima qualità, giusto peso&lt;\\/span&gt;\\n &lt;\\/span&gt;\\n &lt;\\/td&gt;\\n &lt;\\/tr&gt;\\n <tr> \\n <td class='\\"selectable-review-entry\\"' data-mastertstyle-id='\\"\\"' data-review-id='\\"9469234\\"'> \\n <span btn-link="" btn-small="" class='\\"btn' review-list-link\\"=""> \\n 5\\n <img 2015\\"="" alt='\\"Bewertung' src='\\"http://bp-webtools1.otto.boreus.de/tools/images/app/reviews/bewertung_stern_2015.png\\"' stern=""/> \\n\\n <span ....="" altri="" anche="" bello="" colori="" e="" funzionale.="" in="" regolare.\\"="" taglia="" title='\\"Preso'> Preso anche in altri colori .... bello e funzionale. Taglia regolare.&lt;\\/span&gt;\\n &lt;\\/span&gt;\\n &lt;\\/td&gt;\\n &lt;\\/tr&gt;\\n &lt;\\/tbody&gt;\\n &lt;\\/table&gt;\\n &lt;\\/div&gt;\\n &lt;\\/div&gt;\\n\\n <div class='\\"span9\\"'> \\n <div class='\\"row-fluid\\"'> \\n <div class='\\"span3\\"'> \\n <div class='\\"label' full-height\\"="" id='\\"product-data-panel\\"'> \\n <span class='\\"panel-headline\\"'> Informazioni articolo&lt;\\/span&gt;\\n <hr/> \\n <a href='\\"https://www.bonprix.it/search.htm?qu=95341195\\"' target='\\"_blank\\"'> <img src="\\'http://image01.bonprix.de/bonprixbilder//assets/114x160/13050022.jpg\\'"/> &lt;\\/a&gt;\\n <label> N. art.&lt;\\/label&gt;\\n <a class='\\"btn-link\\"' href='\\"https://www.bonprix.it/search.htm?qu=95341195\\"' target='\\"_blank\\"'> 95341195&lt;\\/a&gt;\\n <label> Masterstyle-ID&lt;\\/label&gt;\\n52826321\\n <label> Digistyle-ID&lt;\\/label&gt;\\n12709620\\n <label> Ø Media dei voti&lt;\\/label&gt;\\n4.45 <img 2015\\"="" alt='\\"Bewertung' src='\\"http://bp-webtools1.otto.boreus.de/tools/images/app/reviews/bewertung_stern_2015.png\\"' stern=""> \\n <label> Lunghezza&lt;\\/label&gt;\\nGiusto\\n <label> Larghezza&lt;\\/label&gt;\\nGiusto\\n <label> Disponibilità&lt;\\/label&gt;\\n\\n(37)\\n\\n &lt;\\/div&gt;\\n &lt;\\/div&gt;\\n\\n <div class='\\"span5\\"'> \\n <div class='\\"label' full-height\\"="" id='\\"single-review-panel\\"'> \\n <span class='\\"panel-headline\\"'> Dati cliente&lt;\\/span&gt;\\n <hr/> \\n <table class='\\"customer-info-table\\"'> \\n <tr> \\n <td> \\n <label> Nome&lt;\\/label&gt;\\n nome\\n &lt;\\/td&gt;\\n <td> \\n <label> Cognome&lt;\\/label&gt;\\n cognome\\n &lt;\\/td&gt;\\n &lt;\\/tr&gt;\\n <tr> \\n <td> \\n <label> Codice cliente&lt;\\/label&gt;\\n N/A\\n &lt;\\/td&gt;\\n <td> \\n <label> Indirizzo e-mail&lt;\\/label&gt;\\n ********@gmail.com\\n &lt;\\/td&gt;\\n &lt;\\/tr&gt;\\n&lt;\\/table&gt;\\n\\n <span class='\\"panel-headline\\"'> Commento articolo&lt;\\/span&gt;\\n <hr/> \\n <i class='\\"rating' r5\\"=""> &lt;\\/i&gt; <br/> \\n\\n <textarea id='\\"review-text\\"' name='\\"text\\"' readonly='\\"readonly\\"' rows='\\"12\\"'>\\nBel colore come in foto, morbidissima, piacevole sulla pelle, non stringe anzi aderisce bene senza segnare i difettucci ei rotolini.&lt;\\/textarea&gt;\\n\\n<span class='\\"panel-headline\\"'>Commenti sulla vestibilità&lt;\\/span&gt;\\n<hr/>\\n<table class='\\"size-info-table\\"'>\\n <tr>\\n <td>\\n <label>Lunghezza&lt;\\/label&gt;\\n Giusto\\n &lt;\\/td&gt;\\n <td>\\n <label>Larghezza&lt;\\/label&gt;\\n Giusto\\n &lt;\\/td&gt;\\n <td>\\n <label>Taglia&lt;\\/label&gt;\\n 62/64\\n &lt;\\/td&gt;\\n <td>\\n <label>Varianti&lt;\\/label&gt;\\n  \\n &lt;\\/td&gt;\\n <td>\\n <label>Statura&lt;\\/label&gt;\\n 165-169\\n &lt;\\/td&gt;\\n &lt;\\/tr&gt;\\n&lt;\\/table&gt;\\n<p>\\n <table class='\\"table\\"'>\\n <tr>\\n <td>\\n <b>Rezensions-ID:&lt;\\/b&gt;\\n <span id='\\"review-id\\"'>10613555&lt;\\/span&gt;\\n &lt;\\/td&gt;\\n <td>\\n <b>Creata:&lt;\\/b&gt;\\n <span class='\\"utc-date\\"'>\\n 01.10.2017 11:06:26\\n &lt;\\/span&gt;\\n &lt;\\/td&gt;\\n &lt;\\/tr&gt;\\n <tr>\\n <td>\\n <b>Letzte Änderung&lt;\\/b&gt;\\n <span class='\\"utc-date\\"'>\\n 01.10.2017 11:06:26\\n &lt;\\/span&gt;\\n &lt;\\/td&gt;\\n <td>\\n <b>di&lt;\\/b&gt;\\n Kunde\\n &lt;\\/td&gt;\\n &lt;\\/tr&gt;\\n <tr>\\n <td>\\n <b>Data pubblicazione:&lt;\\/b&gt;\\n <span class='\\"utc-date\\"'>\\n 01.10.2017 11:06:26\\n &lt;\\/span&gt;\\n &lt;\\/td&gt;\\n &lt;\\/tr&gt;\\n &lt;\\/table&gt;\\n&lt;\\/p&gt;\\n\\n &lt;\\/div&gt;\\n &lt;\\/div&gt;\\n\\n <div class='\\"span4\\"'>\\n <div class='\\"label' full-height\\"="" id='\\"editing-functions-panel\\"'>\\n <span class='\\"panel-headline\\"'>Modifica&lt;\\/span&gt;\\n<hr/>\\n<div>\\n <label>Scegli un destinatario&lt;\\/label&gt;\\n <a class='\\"btn-link\\"' false;\\"="" href='\\"#\\"' id='\\"reset-recipients-list-link\\"' onclick='\\"reviews.resetRecipientsList(true);' return="">Cancella la lista destinatari&lt;\\/a&gt;\\n <select id='\\"email-recipients-select\\"' name='\\"email-recipients-select\\"'><option value='\\"\\"'>&lt;\\/option&gt;\\n<option value='\\"*****@*****.it\\"'>servizio@******.it&lt;\\/option&gt;&lt;\\/select&gt;\\n <textarea id='\\"email-recipients-textarea\\"' name='\\"email-recipients-textarea\\"'>\\n&lt;\\/textarea&gt;\\n <a class='\\"btn\\"' data-confirm-translation-modified-text='\\"Die' false;\\"="" gespeichert.="" href='\\"#\\"' id='\\"send-mail-btn\\"' nicht="" noch="" onclick='\\"reviews.sendMail(true);' return="" rezension="" trotzdem="" versenden?\\"="" wurde="" übersetzung="">Invia recensione&lt;\\/a&gt;\\n <label>Traduci&lt;\\/label&gt;\\n <textarea id='\\"review-uebersetzung\\"' name='\\"text\\"'>\\n&lt;\\/textarea&gt;\\n <label>Feedback al cliente&lt;\\/label&gt;\\n <textarea id='\\"review-feedbackToCustomer\\"' name='\\"text\\"'>\\n&lt;\\/textarea&gt;\\n&lt;\\/div&gt;\\n<div>\\n <label>Tipo di recensione&lt;\\/label&gt;\\n <select id='\\"review-meinungstyp\\"' name='\\"meinungstyp\\"'><option selected='\\"selected\\"' value='\\"R\\"'>Recensione&lt;\\/option&gt;\\n<option value='\\"G\\"'>Risposte&lt;\\/option&gt;\\n<option value='\\"A\\"'>Archivio&lt;\\/option&gt;&lt;\\/select&gt;\\n&lt;\\/div&gt;\\n<div id='\\"aktiv-checkboxes-container\\"'>\\n <div class='\\"control-group' use-bootstrap-groups\\"="">\\n <label class='\\"control-label\\"' for='\\"review_aktiv\\"'>Pubblicata&lt;\\/label&gt;\\n <input id='\\"review_aktiv\\"' name='\\"review_aktiv\\"' type='\\"hidden\\"' value='\\"T\\"'/>\\n <div class='\\"controls\\"'>\\n <div class='\\"btn-group\\"'>\\n <a btn="" btn-success\\"="" class='\\"change-active-state' data-value='\\"T\\"' href='\\"#\\"'>Sì&lt;\\/a&gt;\\n <a \\"="" btn="" class='\\"change-active-state' data-value='\\"F\\"' href='\\"#\\"'>No&lt;\\/a&gt;\\n &lt;\\/div&gt;\\n &lt;\\/div&gt;\\n &lt;\\/div&gt;\\n&lt;\\/div&gt;\\n\\n<div class='\\"row-fluid' form-actions="" possible-multi-line\\"="">\\n <a btn-primary\\"="" class='\\"btn' false;\\"="" href='\\"#\\"' id='\\"save-review-btn\\"' onclick='\\"reviews.saveReview(true);' remote='\\"true\\"' return="">Salva recensione&lt;\\/a&gt;\\n <a btn-danger\\"="" class='\\"btn' data-confirm-dialog-title='\\"Cancella' false;\\"="" href='\\"#\\"' id='\\"delete-review-btn\\"' onclick='\\"reviews.deleteSelectedReview(true);' recensioni\\"="" remote='\\"true\\"' return=""><i class="\\'icon-trash" icon-white\\'="">&lt;\\/i&gt; Cancella recensioni&lt;\\/a&gt;\\n&lt;\\/div&gt;\\n\\n &lt;\\/div&gt;\\n &lt;\\/div&gt;\\n &lt;\\/div&gt;\\n &lt;\\/div&gt;\\n&lt;\\/div&gt;\\n').trigger('repaint'); reviews.initEditReviewTab(); $('#reviews-tab-navigation').tabs('option', 'active', 0); $('.search-tab-buttons').html('<div class='\\"search-tab-buttons\\"'>\\n <table>\\n <tr>\\n <td><a btn-primary\\"="" class='\\"btn' false;\\"="" href='\\"#\\"' onclick='\\"reviews.submitSearchReviews();' remote='\\"true\\"' return="">Cerca&lt;\\/a&gt;&lt;\\/td&gt;\\n <td><a btn-default\\"="" class='\\"btn' false;\\"="" href='\\"#\\"' onclick='\\"reviews.setDefaultSearchParams();' remote='\\"true\\"' return="">Ricerca standard&lt;\\/a&gt;&lt;\\/td&gt;\\n <td><a btn-default\\"="" class='\\"btn' false;\\"="" href='\\"#\\"' onclick='\\"reviews.showStatistics(true);' remote='\\"true\\"' return="">Statistiche&lt;\\/a&gt;&lt;\\/td&gt;\\n &lt;\\/tr&gt;\\n &lt;\\/table&gt;\\n&lt;\\/div&gt;'); $('.mini-statistics').replaceWith(' <div class='\\"mini-statistics\\"'>\\n <p>\\n Da controllare: 100 / Pubblicata: 304316 / Non pubblicata: 9207 / Prenotate: [0], mie: [0]\\n &lt;\\/p&gt;\\n &lt;\\/div&gt;\\n'); </p></div></a></td></a></td></a></td></tr></table></div></i></a></a></div></a></a></div></div></label></div></div></option></option></option></select></label></div></textarea></label></textarea></label></a></textarea></option></option></select></a></label></div></span></div></div></span></b></td></tr></b></td></span></b></td></tr></span></b></td></span></b></td></tr></table></p></label></td></label></td></label></td></label></td></label></td></tr></table></span></textarea> </i> </span> </label> </td> </label> </td> </tr> </label> </td> </label> </td> </tr> </table> </span> </div> </div> </label> </label> </label> </img> </label> </label> </label> </a> </label> </a> </span> </div> </div> </div> </div> </span> </span> </td> </tr> </span> </span> </td> </tr> </span> </span> </td> </tr> </span> </span> </td> </tr> </span> </span> </td> </tr> </tbody> </span> </th> </tr> </thead> </table> </span> </div> </div> </div> 

Basically I would like to extract the number after each "data-review-id" (in this portion of html there are 5: 10613555, 10610141, 10575319, 10554514, 9469234) but I don't understand which tags I should select to get the result I want. 基本上,我想在每个“ data-review-id”之后提取数字(在html的此部分中有5:10613555、10610141、10575319、10554514、9495234),但我不明白应该选择哪个标签我想要的结果。

I've tried several combinations of soup.find_all but without any result. 我已经尝试了soup.find_all的几种组合,但没有任何结果。

Any help or suggestion would be really appreciated. 任何帮助或建议,将不胜感激。

Thanks in advance! 提前致谢!

The HTML you have is inside some Javascript and appears to have been escaped. 您拥有的HTML在某些Javascript内,并且似乎已被转义。 Copy/pasting the exact HTML you have given and assigning it to html , the following could be used: 复制/粘贴您提供的确切HTML,并将其分配给html ,可以使用以下内容:

from bs4 import BeautifulSoup

html = """ ---- add HTML here ---"""

html = html.replace('"', ''). replace(r'\/', '/')
soup = BeautifulSoup(html, "html.parser")

for td in soup.find_all('td', {'data-review-id':True}):
    print td['data-review-id']

This then displays: 然后显示:

10613555
10610141
10575319
10554514
9469234

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM