从 html 文档中提取标签内的文本

Question

I have an html document like this: https://dropmefiles.com/wezmb So I need to extract text inside tags <span id="1" and </span, but I don't know how.我有一个像这样的 html 文档： https://dropmefiles.com/wezmb所以我需要提取标签内的文本 <span id="1" 和 </span，但我不知道如何。 I'm trying and write this code:我正在尝试编写以下代码：

from bs4 import BeautifulSoup

with open("10_01.htm") as fp:
    soup = BeautifulSoup(fp,features="html.parser")
    for a in soup.find_all('span'):
      print (a.string)

But it extract all information from all 'span' tags.但它从所有“跨度”标签中提取所有信息。 So, how can i extract text inside tags <span id="1" and </span in Python?那么，如何在 Python 中的标签 <span id="1" 和 </span 中提取文本？

Answer 1

What you need is the .contents function.您需要的是.contents function。 documentation 文件

Find the span <span id = "1">... </span> using查找跨度<span id = "1">... </span>使用

for x in soup.find(id = 1).contents:
    print(x)

OR或者

x = soup.find(id = 1).contents[0] # since there will only be one element with the id 1.
print(x)

This will give you:这会给你：

that is, an empty line followed by 10 followed by another empty line.也就是说，一个空行后跟 10，然后是另一个空行。 This is because the string in the HTML is actually like that and prints 10 in a new line, as you can also see in the HTML that 10 has its separate line.这是因为 HTML 中的字符串实际上是这样的，并且在新行中打印 10，正如您在 HTML 中看到的那样，10 有其单独的行。
The string will correctly be '\n10\n' .该字符串将正确地为'\n10\n' 。

If you want just x = '10' from x = '\n10\n' you can do: x = x[1:-1] since '\n' is a single character.如果你只想要x = '\n10\n'中x = '10' ' 你可以这样做： x = x[1:-1]因为'\n'是单个字符。 Hope this helped.希望这有帮助。

从 html 文档中提取标签内的文本

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-05-19 13:41:35

从 html 文档中提取标签内的文本

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-05-19 13:41:35

解决方案1
1 已采纳 2021-05-19 13:41:35