使用 BeautifulSoup 美化方法的 Python 出现奇怪错误

Question

I got the following problem.我遇到了以下问题。 I wrote a simple "TextBasedBrowser" (if you can even call it browser at this point:D).我写了一个简单的“TextBasedBrowser”（如果你现在可以称它为浏览器的话：D）。 The website scraping and parsing with BS4 works great so far, but the its formatted like shit and pretty much unreadable.到目前为止，使用 BS4 抓取和解析网站的效果很好，但它的格式像狗屎一样，几乎不可读。 As soon as I try to use the prettify() method from BS4 it throws me an AttributeError.一旦我尝试使用 BS4 中的 prettify() 方法，它就会抛出一个 AttributeError。 I searched quite a while on google but couldnt find anything.我在谷歌上搜索了很长时间，但找不到任何东西。 This is my Code (prettify() method is commented out there):这是我的代码（prettify() 方法被注释掉了）：

from bs4 import BeautifulSoup
import requests
import sys
import os

legal_html_tags = ['p', 'a', 'ul', 'ol', 'li', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'title']
saved_pages = []


def search_url(url):
    saved_pages.append(url.rstrip(".com"))
    url = requests.get(f'https://{url}')
    return url.text


def parse_html(html_page):
    final_text = ""
    soup = BeautifulSoup(html_page, 'html.parser')
    # soup = soup.prettify()
    plain_text = soup.find_all(text=True)
    for t in plain_text:
        if t.parent.name in legal_html_tags:
            final_text += '{} '.format(t)
    return final_text


def save_webpage(url, tb_dir):
    with open(f'{tb_dir}/{url.rstrip(".com")}.txt', 'w', encoding="utf-8") as tab:
        tab.write(parse_html(search_url(url)))


def check_url(url):
    if url.endswith(".com") or url.endswith(".org") or url.endswith(".net"):
        return True
    else:
        return False


args = sys.argv
directory = args[1]
try:
    os.mkdir(directory)
except FileExistsError:
    print("Error: File already exists")

while True:
    url_ = input()
    if url_ == "exit":
        break
    elif url_ in saved_pages:
        with open(f'{directory}/{url_}.txt', 'r', encoding="utf-8") as curr_page:
            print(curr_page.read())
    elif not check_url(url_):
        print("Error: Invalid URL")
    else:
        save_webpage(url_, directory)
        print(parse_html(search_url(url_)))

And this is the Error:这是错误：

Traceback (most recent call last):
  File "browser.py", line 56, in <module>
    save_webpage(url_, directory)
  File "browser.py", line 29, in save_webpage
    tab.write(parse_html(search_url(url)))
  File "browser.py", line 20, in parse_html
    plain_text = soup.find_all(text=True)
AttributeError: 'str' object has no attribute 'find_all'

If I include the encoding parameter in the prettify() method it throws me 'bytes' instead of 'str' object.如果我在 prettify() 方法中包含编码参数，它会抛出“字节”而不是“str”object。

Answer 1

prettify turns your parsed HTML object into a string, so you can't call find_all on it. prettify将你解析的 HTML object 变成一个字符串，所以你不能调用find_all 。 Maybe you just want to return soup.prettify() ?也许您只想return soup.prettify() ？ This might be what you want:这可能是你想要的：

def parse_html(html_page):
    final_text = ""
    soup = BeautifulSoup(html_page, 'html.parser')
    plain_text = soup.find_all(text=True)
    for t in plain_text:
        if t.parent.name in legal_html_tags:
            final_text += t.prettify() + " "
    return final_text

Answer 2

You have re-assigned the soup variable into a string using the.prettify() method您已使用 .prettify() 方法将汤变量重新分配到字符串中

soup = soup.prettify()

find_all() is a method for soup objects only find_all() 是仅用于汤对象的方法

You should call find_all(text = True) first and extract all html tags with text, then you perform string operations.您应该先调用 find_all(text = True) 并提取所有带有文本的 html 标签，然后执行字符串操作。

使用 BeautifulSoup 美化方法的 Python 出现奇怪错误

问题描述

2 个解决方案

解决方案1
0 2020-06-16 17:25:09

解决方案2
0 2020-06-17 08:23:51

使用 BeautifulSoup 美化方法的 Python 出现奇怪错误

问题描述

2 个解决方案

解决方案1 0 2020-06-16 17:25:09

解决方案2 0 2020-06-17 08:23:51

解决方案1
0 2020-06-16 17:25:09

解决方案2
0 2020-06-17 08:23:51