简体   繁体   English

从 html 文档中提取标签内的文本

[英]Extracting text inside tags from html document

I have an html document like this: https://dropmefiles.com/wezmb So I need to extract text inside tags <span id="1" and </span, but I don't know how.我有一个像这样的 html 文档: https://dropmefiles.com/wezmb所以我需要提取标签内的文本 <span id="1" 和 </span,但我不知道如何。 I'm trying and write this code:我正在尝试编写以下代码:

from bs4 import BeautifulSoup

with open("10_01.htm") as fp:
    soup = BeautifulSoup(fp,features="html.parser")
    for a in soup.find_all('span'):
      print (a.string)

But it extract all information from all 'span' tags.但它从所有“跨度”标签中提取所有信息。 So, how can i extract text inside tags <span id="1" and </span in Python?那么,如何在 Python 中的标签 <span id="1" 和 </span 中提取文本?

What you need is the .contents function.您需要的是.contents function。 documentation 文件

Find the span <span id = "1">... </span> using查找跨度<span id = "1">... </span>使用

for x in soup.find(id = 1).contents:
    print(x)

OR或者

x = soup.find(id = 1).contents[0] # since there will only be one element with the id 1.
print(x)

This will give you:这会给你:


10

that is, an empty line followed by 10 followed by another empty line.也就是说,一个空行后跟 10,然后是另一个空行。 This is because the string in the HTML is actually like that and prints 10 in a new line, as you can also see in the HTML that 10 has its separate line.这是因为 HTML 中的字符串实际上是这样的,并且在新行中打印 10,正如您在 HTML 中看到的那样,10 有其单独的行。
The string will correctly be '\n10\n' .该字符串将正确地为'\n10\n'

If you want just x = '10' from x = '\n10\n' you can do: x = x[1:-1] since '\n' is a single character.如果你只想要x = '\n10\n'x = '10' ' 你可以这样做: x = x[1:-1]因为'\n'是单个字符。 Hope this helped.希望这有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM