標簽包含文本，還有另一個標簽包含文本。如何獲取文本，而不是帶有 beautifulsoup python 的額外標簽中的文本？

Question

我有以下標簽：

<div class="example_class">
 <b>
 <img src="image_source"/>
          extra infomation
          <a href="reference">
           extra information
          </a>
 </b>
         .
         <br/>
 <br/>
         This is the text I want to get.
         <lt>
 </lt>
         br /
         <gt>
</gt>
<lt>
</lt>
         br /
         <gt>
</gt>
<lt>
</lt>
         br /
         <gt>
</gt>
         This is the rest of the text.
</div>

我想得到文本'這是我想要得到的文本。 這是文本的rest。'，但我不知道如何。 當我嘗試以下操作時：

soup_result = soup.find('div',{'class': 'example_class'})
result = soup_result.get_text()

我得到：

'\n\n\n         extra information\n         \n          extra information\n         \n\n        .\n        \n\n        This is the text I want to get.\n        \n\n        br /\n        \n\n\n\n        br /\n        \n\n\n\n        br /\n        \n\n        This is the rest of the text.'

如何確保“額外信息”和中間有很多空格的換行符不在結果中？

Answer 1

我假設br /是標簽<br /> 。 您可以在.extract .get_text()之前在結果中提取不需要的標簽：

div = soup.find(class_="example_class")

for b in div.find_all("b"):
    b.extract()

text = div.get_text(strip=True, separator="\n")
print(text)

印刷

.
This is the text I want to get.
This is the rest of the text.

標簽包含文本，還有另一個標簽包含文本。如何獲取文本，而不是帶有 beautifulsoup python 的額外標簽中的文本？

問題描述

1 個解決方案

解決方案1
0 2022-04-28 12:49:23

標簽包含文本，還有另一個標簽包含文本。 如何獲取文本，而不是帶有 beautifulsoup python 的額外標簽中的文本？

問題描述

1 個解決方案

解決方案1 0 2022-04-28 12:49:23

標簽包含文本，還有另一個標簽包含文本。如何獲取文本，而不是帶有 beautifulsoup python 的額外標簽中的文本？

解決方案1
0 2022-04-28 12:49:23