简体   繁体   English

美丽的汤嵌套标签搜索

[英]Beautiful Soup Nested Tag Search

I am trying to write a python program that will count the words on a web page.我正在尝试编写一个 python 程序来计算网页上的单词。 I use Beautiful Soup 4 to scrape the page but I have difficulties accessing nested HTML tags (for example: <p class="hello"> inside <div> ).我使用 Beautiful Soup 4 来抓取页面,但我在访问嵌套的 HTML 标记时遇到了困难(例如: <p class="hello"><div>中)。

Every time I try finding such tag using page.findAll() (page is Beautiful Soup object containing the whole page) method it simply doesn't find any, although there are.每次我尝试使用page.findAll() (页面是包含整个页面的 Beautiful Soup 对象)方法找到这样的标签时,它根本找不到任何标签,尽管有。 Is there any simple method or another way to do it?有什么简单的方法或其他方法吗?

Maybe I'm guessing what you are trying to do is first looking in a specific div tag and the search all p tags in it and count them or do whatever you want.也许我猜你正在尝试做的是首先查看特定的 div 标签并搜索其中的所有 p 标签并计算它们或做任何你想做的事情。 For example:例如:

soup = bs4.BeautifulSoup(content, 'html.parser') 

# This will get the div
div_container = soup.find('div', class_='some_class')  

# Then search in that div_container for all p tags with class "hello"
for ptag in div_container.find_all('p', class_='hello'):
    # prints the p tag content
    print(ptag.text)

Hope that helps希望有帮助

Try this one :试试这个:

data = []
for nested_soup in soup.find_all('xyz'):
    data = data + nested_soup.find_all('abc')
# data holds all shit together

Maybe you can turn in into lambda and make it cool, but this works.也许你可以变成 lambda 并让它变得很酷,但这很有效。 Thanks.谢谢。

UPDATE : I noticed that text does not always return the expected result, at the same time, I realized there was a built-in way to get the text, sure enough reading the docs we read that there is a method called get_text(), use it as:更新:我注意到文本并不总是返回预期的结果,同时,我意识到有一种内置的方式来获取文本,果然阅读我们阅读的文档,有一个名为 get_text() 的方法,将其用作:

from bs4 import BeautifulSoup
fd = open('index.html', 'r')
website= fd.read()
fd.close()
soup = BeautifulSoup(website)
contents= soup.get_text(separator=" ")
print "number of words %d" %len(contents.split(" "))

INCORRECT, please read above .Supposing that you have your html file locally in index.html you can:不正确,请阅读上面的内容。假设您在 index.html 中本地有您的 html 文件,您可以:

from bs4 import BeautifulSoup
import re
BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
fd = open('index.html', 'r')
website= fd.read()
soup = BeautifulSoup(website)
tags=soup.find_all(True) # find everything
print "there are %d" %len(tags)

count= 0
matcher= re.compile("(\s|\n|<br>)+")
for tag in tags:
if tag.name.lower() in BLACKLIST:
    continue
    temp = matcher.split(tag.text) # Split using tokens such as \s and \n
    temp = filter(None, temp) # remove empty elements in the list
    count +=len(temp)
print "number of words in the document %d" %count
fd.close()

Please note that it may not be accurate, maybe because of errors in formatting, false positives(it detects any word, even if it is code), text that is shown dynamically using javascript or css, or other reason请注意,它可能不准确,可能是由于格式错误、误报(它检测到任何单词,即使是代码)、使用 javascript 或 css 动态显示的文本或其他原因

You don't need to write a for loop.您不需要编写 for 循环。 You can nest the soups, if you like.如果你愿意,你可以把汤嵌套起来。

BeautifulSoup(
    str(BeautifulSoup(page_source, 'html.parser').findAll('div')),
    'html.parser'
    ).findAll('p', {'class': 'hello'})

You can find all <p> tags using regular expressions (re module).您可以使用正则表达式(re 模块)找到所有<p>标记。 Note that r.content is a string which contains the whole html of the site.请注意, r.content是一个字符串,其中包含网站的整个 html。

for eg:例如:

 r = requests.get(url,headers=headers)
 p_tags = re.findall(r'<p>.*?</p>',r.content)

this should get you all the <p> tags irrespective of whether they are nested or not.这应该为您提供所有<p>标签,无论它们是否嵌套。 And if you want the a tags specifically inside the tags you can add that whole tag as a string in the second argument instead of r.content .如果您希望 a 标签专门在标签内,您可以将整个标签作为字符串添加到第二个参数中,而不是r.content

Alternatively if you just want just the text you can try this:或者,如果你只想要文本,你可以试试这个:

from readability import Document #pip install readability-lxml
import requests
r = requests.get(url,headers=headers)
doc = Document(r.content)
simplified_html = doc.summary()

this will get you a more bare bones form of the html from the site, and now proceed with the parsing.这将使您从站点获得更简单的 html 形式,然后继续进行解析。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM