简体   繁体   中英

Using BeautifulSoup to transfer text from one HTML document to another

I am trying to extract the category names and question/answer text from pages on this site and insert them into my own HTML document using Python. I've been able to extract the clue text by using soup.find_all("td", class_="clue_text) , and in theory know how I would extract the other data, but I don't know how to insert that data into my own HTML document, especially considering BeautifulSoup outputs a list, and my text is formatted differently than the source. For example, I would want the clue text to replace "Category 2 Question 5" in the following HTML:

<table id="4_1" cellpadding="0" cellspacing="0" width="100%" 
class="hiddenDiv" onclick="hidequestion(this.id);" border="0"><tr><td 
valign="middle" align="center">
Category 2 Question 5
</td></tr></table>

How would I go about using BeautifulSoup to output into my document? Is there a better method I could use instead?

You can use the .string property to change the text/string of any tag.

>>> html = '''<table id="4_1" cellpadding="0" cellspacing="0" width="100%"
... class="hiddenDiv" onclick="hidequestion(this.id);" border="0"><tr><td
... valign="middle" align="center">
... Category 2 Question 5
... </td></tr></table>'''
>>> soup = BeautifulSoup(html, 'lxml')
>>> clue = 'this is my clue text'
>>> first_rowcol = soup.find('table').find('td')
>>> first_rowcol
<td align="center" valign="middle">
Category 2 Question 5
</td>
>>> first_rowcol.string = clue
>>> first_rowcol
<td align="center" valign="middle">this is my clue text</td>

Or, if you want to replace the td tag with the td tag you found using BeautifulSoup, you can use the replace_with() function.

>>> first_row = soup.find('table').tr
>>> first_row
<tr><td align="center" valign="middle">
Category 2 Question 5
</td></tr>
>>> clue_tag = BeautifulSoup('<td>this is my clue tag</td>', 'html.parser')
>>> first_row.td.replace_with(clue_tag)
>>> first_row
<tr><td>this is my clue tag</td></tr>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM