简体   繁体   中英

Beautiful Soup doesn't give data for a site

I am trying to this site for information: https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=B&b=1&page=0

I tried writing code that has worked for other sites, but it just leaves me with an empty text file. Instead of filling up with data like it has for other sites. Here is my code:

import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import json
import time
outfile = open('/Users/Luca/Desktop/test/farm_data.text','w')
my_list = list()

site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=A&b=1&page=0"
my_list.append(site)
site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=B&b=1&page=0"
my_list.append(site)
site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=C&b=1&page=0"
my_list.append(site)


for item in my_list:
    time.sleep( 5 )
    html = urlopen(item)
    bsObj = BeautifulSoup(html.read(), "html.parser")
    nameList = bsObj.prettify().split('.')
    count = 0
    for name in nameList:
            print (name[2:])
            outfile.write(name[2:] + ',' + item + '\n')

I am trying to split it into smaller parts and go from there. I have used this code on sites like this: https://www.mtggoldfish.com/price/Aether+Revolt/Heart+of+Kiran#online

for example and it worked.

Any ideas why it works for some sites and not others? thanks so much.

The website in question probably disallows webscraping, which is why you get:

HTTPError: HTTP Error 403: Forbidden

You can spoof your user agent, by pretending to be a browser agent. Here's an example of how to do it using the fantastic requests module. You'll pass a User-Agent header when making the request.

import requests

url = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=A&b=1&page=0"
html = requests.get(url, headers={'User-Agent' : 'Mozilla/5.0'}).text
bsObj = BeautifulSoup(html, "html.parser")
print(bsObj)

Output:

<!DOCTYPE doctype html>    
<html class="no-js" lang="en" prefix="og: http://ogp.me/ns#" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#">
<head>
<meta charset="utf-8"/>
.
.
.

You can massage this code into your loop now.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM