简体   繁体   English

美丽的汤没有返回我所期望的任何东西

[英]Beautiful Soup not returning anything I expected

Background: Following along with a Udemy tutorial which is parsing some information from Bing.背景:跟随一个 Udemy 教程,该教程正在解析来自 Bing 的一些信息。 It takes in a user input and uses that as a parameter to search Bing with, returning all the href links it can find on the first page它接受用户输入并将其用作搜索 Bing 的参数,返回它可以在第一页上找到的所有href链接

Code:代码:

from bs4 import BeautifulSoup
import requests as re

search = input("Enter what you wanna search: \n")
params = {"q": search}
r = re.get("https://www.bing.com/search", params=params)

soup = BeautifulSoup(r.text, 'html.parser')

results = soup.find("ol",{"id":"b_results"})
links = results.findAll("li",{"class": "b_algo"})


for item in links:
    item_text = item.find("a").text
    item_href = item.href("a").attrs["href"]

    if item_text and item_href:
        print(item_text)
        print(item_href)

    else:
        print("Couldn't find 'a' or 'href'")

Problem: It returns nothing.问题:它什么也不返回。 The code obviously works for him.该代码显然对他有用。 I get no errors as I've checked the id and class names to see if they've been changed on bing itself since the video was made but they are still the same我没有收到任何错误,因为我检查了idclass名称,看看它们在制作视频后是否在 bing 本身上发生了更改,但它们仍然相同

"ol",{"id":"b_results"}
"li",{"class": "b_algo"}

Any ideas?有任何想法吗? I'm a complete noob to web scraping but intermediate to Python.我是 web 刮擦的完全菜鸟,但介于 Python 之间。

Thanks in advance!提前致谢!

Your code needs a bit of reworking.您的代码需要一些修改。

First of all, you need headers otherwise Bing (correctly) thinks you're a bot and it's not returning the HTML .首先,您需要headers ,否则Bing (正确地)认为您是机器人并且它没有返回HTML

Then, you need to check if the anchors are not None and, say, have at least http in the href .然后,您需要检查锚点是否不是None ,例如,在href中至少有http

For example:例如:

from bs4 import BeautifulSoup
import requests


headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36",
}
page = requests.get("https://www.bing.com/search?", headers=headers, params={"q": "python"}).text
soup = BeautifulSoup(page, 'html.parser')

anchors = soup.find_all("a")
for anchor in anchors:
    if anchor is not None:
        try:
            if "http" in anchor["href"]:
                print(anchor.getText(), anchor["href"])
        except KeyError:
            continue

Output: Output:

Welcome to Python.org https://www.python.org/
Diese Seite übersetzen http://www.microsofttranslator.com/bv.aspx?ref=SERP&br=ro&mkt=de-DE&dl=de&lp=EN_DE&a=https%3a%2f%2fwww.python.org%2f
Python Downloads https://www.python.org/downloads/
Windows https://www.python.org/downloads/windows/
Python for Beginners https://www.python.org/about/gettingstarted/
About https://www.python.org/about/
Documentation https://www.python.org/doc/
Community https://www.python.org/community/
Success Stories https://www.python.org/success-stories/
News https://www.python.org/blogs/
Python (Programmiersprache) – Wikipedia https://de.wikipedia.org/wiki/Python_%28Programmiersprache%29
Wikipedia https://de.wikipedia.org/wiki/Python_%28Programmiersprache%29
CC-BY-SA-Lizenz http://creativecommons.org/licenses/by-sa/3.0/
Python lernen - Python Kurs für Anfänger und Fortgeschrittene https://www.python-lernen.de/
Python 3.9.0 (64bit) für Windows - Download https://python.de.uptodown.com/windows
Python-Tutorial: Tutorial für Anfänger und Fortgeschrittene https://www.python-kurs.eu/kurs.php
Mehr zu python-kurs.eu anzeigen https://www.python-kurs.eu/kurs.php
Python (Programmiersprache) – Wikipedia https://de.wikipedia.org/wiki/Python_%28Programmiersprache%29
Python (Programmiersprache) - Wikipedia https://de.wikipedia.org/wiki/Python_%28Programmiersprache%29

By the way, what course is this, because scraping search engines is not easy?顺便问一下,这是什么课程,因为抓取搜索引擎并不容易?

your script is working fine.你的脚本工作正常。 If you look carefully to the requests answer (eg save r.text into a file), you'll see the answer is full of javascript.如果您仔细查看请求的答案(例如,将r.text保存到文件中),您会看到答案充满了 javascript。

Following this method, you'll see that the body is full of <script> balises:按照这个方法,你会看到身体里充满了<script>应答器:

<!DOCTYPE html>
<body>
<script>(...)</script>
<script>(...)</script>
<script>(...)</script>
</body>
</html>

I suggest to try another website, or use Selenium.我建议尝试其他网站,或使用 Selenium。 Did Udemy really ask to try to scrape bing.com? Udemy 真的要求尝试刮 bing.com 吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM