简体   繁体   中英

How can I extract the text between comment tags with Beautiful Soup?

I have the following html code:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html><!-- InstanceBegin template="/Templates/BandDetails.dwt" codeOutsideHTMLIsLocked="false" -->
<head>
<!-- InstanceBeginEditable name="doctitle" -->
<title>&lt;BLR&gt;</title>
<!-- InstanceEndEditable -->
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<!-- InstanceBeginEditable name="head" --><!-- InstanceEndEditable -->
</head>

<body>
<div align="center">
  <table width="0" border="0" cellpadding="0" cellspacing="0" id="mainTable">
    <tr>
      <td colspan="2" id="navbar"><!--#include file="menu.htm" --></td>
    </tr>
    <tr>
      <td id="maincontent"><table width="0" border="0" cellpadding="0" cellspacing="0" id="contentInner">
        <tr>
          <td class="bodytext">
            <p></p><!-- InstanceBeginEditable name="bigPicture-378wide" --><img src="images/BLRlarge.jpg" alt="BLR" width="378" height="324" class="PictureFloatRight"><!-- InstanceEndEditable -->          
            <!-- InstanceBeginEditable name="DAYdateMonthYear" -->
            <p>Thursday 11th March 2010 </p>
            <!-- InstanceEndEditable -->

how can I extract just the text contained within the comment tags using Beautiful Soup? For example, I want to return:

<BLR>

Thursday 11th March 2010

thanks

You might find this program helpful:

from bs4 import BeautifulSoup
from bs4.element import Comment, NavigableString
html_doc = 'x.html'
soup = BeautifulSoup(open(html_doc))

# Identify the start comment
def isInstanceBeginEditable(text):
    return (isinstance(text, Comment) and
            text.strip().startswith("InstanceBeginEditable"))

# Identify the end comment
def isInstanceEndEditable(text):
    return (isinstance(text, Comment) and
            text.strip().startswith("InstanceEndEditable"))

# Look for start comments
for instanceBeginEditable in soup.find_all(text=isInstanceBeginEditable):
    # We found a start comment, look at all text and comments:
    for text in instanceBeginEditable.find_all_next(text=True):
        # We found a text or comment, examine it closely
        if isInstanceEndEditable(text):
            # We found the end comment, everybody out of the pool
            break
        if isinstance(text, Comment):
            # We found a comment, ignore
            continue
        if not text.strip():
            # We found a blank text, ignore
            continue
        # Whatever is left must be good
        print text
import bs4
soup = bs4.BeautifulSoup(html_text)
soup.get_text().replace('\n','')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM