简体   繁体   English

HTTP 错误 400:错误请求 (urllib)

[英]HTTP Error 400: Bad Request (urllib)

I'm writing a script to get information regarding buildings in NYC.我正在编写一个脚本来获取有关纽约市建筑物的信息。 I know that my code works and returns what i'd like it to.我知道我的代码可以工作并返回我想要的。 I was previously doing manual entry and it worked.我以前在做手动输入,它奏效了。 Now i'm trying to have it read addresses from a text file and access the website with that information and i'm getting this error:现在我试图让它从文本文件中读取地址并使用该信息访问网站,但出现此错误:

urllib.error.HTTPError: HTTP Error 400: Bad Request urllib.error.HTTPError: HTTP 错误 400: 错误请求

I believe it has something to do with the website not liking lots of access from something that isn't a browser.我相信这与网站不喜欢从不是浏览器的东西进行大量访问有关。 I've heard something about User Agents but don't know how to use them.我听说过一些关于用户代理的事情,但不知道如何使用它们。 Here is my code:这是我的代码:

from bs4 import BeautifulSoup
import urllib.request

f = open("FILE PATH GOES HERE")

def getBuilding(link):
    r = urllib.request.urlopen(link).read()
    soup = BeautifulSoup(r, "html.parser")
    print(soup.find("b",text="KEYWORDS IM SEARCHING FOR GO HERE:").find_next("td").text)


def main():
    for line in f:
        num, name = line.split(" ", 1)
        newName = name.replace(" ", "+")
        link = "LINK GOES HERE (constructed from num and newName variables)"
        getBuilding(link)      
    f.close()

if __name__ == "__main__":
    main()

A 400 error means that the server cannot understand your request (eg, malformed syntax). 400 错误意味着服务器无法理解您的请求(例如,格式错误的语法)。 That said, its up to the developers on what status code they want to return and, unfortunately, not everyone strictly follows their intended meaning.也就是说,由开发人员决定他们想要返回什么状态代码,不幸的是,并非每个人都严格遵循其预期含义。

Check out this page for more details on HTTP Status Codes.查看此页面以了解有关 HTTP 状态代码的更多详细信息。

With regards on how to how to set a User Agent: A user agent is set in the request header and, basically, defines the client making the request.关于如何设置用户代理:用户代理设置在请求头中,基本上定义了发出请求的客户端。 Here is a list of recognized User Agents .这是公认的用户代理列表。 You will need to use urllib2 , rather than urllib , but urllib2 is also a built-in package.您将需要使用urllib2而不是urllib ,但urllib2也是一个内置包。 I will show you how update the getBuilding function to set the header using that module.我将向您展示如何更新getBuilding函数以使用该模块设置标头。 But I would recommend checking out the requests library.但我建议查看请求库。 I just find that to be super straight-forward and it is highly adopted/supported.我只是发现这非常简单,并且得到了高度的采用/支持。

Python 2:蟒蛇2:

from urllib2 import Request, urlopen

def getBuilding(link):        
    q = Request(link)
    q.add_header('User-Agent', 'Mozilla/5.0')
    r = urlopen(q).read()
    soup = BeautifulSoup(r, "html.parser")
    print(soup.find("b",text="KEYWORDS IM SEARCHING FOR GO HERE:").find_next("td").text)

Python 3:蟒蛇3:

from urllib.request import Request, urlopen

def getBuilding(link):        
    q = Request(link)
    q.add_header('User-Agent', 'Mozilla/5.0')
    r = urlopen(q).read()
    soup = BeautifulSoup(r, "html.parser")
    print(soup.find("b",text="KEYWORDS IM SEARCHING FOR GO HERE:").find_next("td").text)

Note: The only difference between Python v2 and v3 is the import statement.注意:Python v2 和 v3 之间的唯一区别是 import 语句。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM