简体   繁体   中英

Beautiful Soup not returning anything I expected

Background: Following along with a Udemy tutorial which is parsing some information from Bing. It takes in a user input and uses that as a parameter to search Bing with, returning all the href links it can find on the first page

Code:

from bs4 import BeautifulSoup
import requests as re

search = input("Enter what you wanna search: \n")
params = {"q": search}
r = re.get("https://www.bing.com/search", params=params)

soup = BeautifulSoup(r.text, 'html.parser')

results = soup.find("ol",{"id":"b_results"})
links = results.findAll("li",{"class": "b_algo"})


for item in links:
    item_text = item.find("a").text
    item_href = item.href("a").attrs["href"]

    if item_text and item_href:
        print(item_text)
        print(item_href)

    else:
        print("Couldn't find 'a' or 'href'")

Problem: It returns nothing. The code obviously works for him. I get no errors as I've checked the id and class names to see if they've been changed on bing itself since the video was made but they are still the same

"ol",{"id":"b_results"}
"li",{"class": "b_algo"}

Any ideas? I'm a complete noob to web scraping but intermediate to Python.

Thanks in advance!

Your code needs a bit of reworking.

First of all, you need headers otherwise Bing (correctly) thinks you're a bot and it's not returning the HTML .

Then, you need to check if the anchors are not None and, say, have at least http in the href .

For example:

from bs4 import BeautifulSoup
import requests


headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36",
}
page = requests.get("https://www.bing.com/search?", headers=headers, params={"q": "python"}).text
soup = BeautifulSoup(page, 'html.parser')

anchors = soup.find_all("a")
for anchor in anchors:
    if anchor is not None:
        try:
            if "http" in anchor["href"]:
                print(anchor.getText(), anchor["href"])
        except KeyError:
            continue

Output:

Welcome to Python.org https://www.python.org/
Diese Seite übersetzen http://www.microsofttranslator.com/bv.aspx?ref=SERP&br=ro&mkt=de-DE&dl=de&lp=EN_DE&a=https%3a%2f%2fwww.python.org%2f
Python Downloads https://www.python.org/downloads/
Windows https://www.python.org/downloads/windows/
Python for Beginners https://www.python.org/about/gettingstarted/
About https://www.python.org/about/
Documentation https://www.python.org/doc/
Community https://www.python.org/community/
Success Stories https://www.python.org/success-stories/
News https://www.python.org/blogs/
Python (Programmiersprache) – Wikipedia https://de.wikipedia.org/wiki/Python_%28Programmiersprache%29
Wikipedia https://de.wikipedia.org/wiki/Python_%28Programmiersprache%29
CC-BY-SA-Lizenz http://creativecommons.org/licenses/by-sa/3.0/
Python lernen - Python Kurs für Anfänger und Fortgeschrittene https://www.python-lernen.de/
Python 3.9.0 (64bit) für Windows - Download https://python.de.uptodown.com/windows
Python-Tutorial: Tutorial für Anfänger und Fortgeschrittene https://www.python-kurs.eu/kurs.php
Mehr zu python-kurs.eu anzeigen https://www.python-kurs.eu/kurs.php
Python (Programmiersprache) – Wikipedia https://de.wikipedia.org/wiki/Python_%28Programmiersprache%29
Python (Programmiersprache) - Wikipedia https://de.wikipedia.org/wiki/Python_%28Programmiersprache%29

By the way, what course is this, because scraping search engines is not easy?

your script is working fine. If you look carefully to the requests answer (eg save r.text into a file), you'll see the answer is full of javascript.

Following this method, you'll see that the body is full of <script> balises:

<!DOCTYPE html>
<body>
<script>(...)</script>
<script>(...)</script>
<script>(...)</script>
</body>
</html>

I suggest to try another website, or use Selenium. Did Udemy really ask to try to scrape bing.com?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM