简体   繁体   中英

requests.get returns 400 bad url when given a variable containing a url, but not when given a string with the same url

I have a program that reads some URLs from a text file, gets the page source with requests.get, and then uses beautifulsoup4 to find some information.

f = open('inputfile.txt')
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'})
for line in f:
    x = 0
    z = len(line)
    r = session.get(line[x:z])
    soup = bs4.BeautifulSoup(r.text, "html.parser")

This returns an HTTP 400 Bad Request - Invalid URL. However, when I do the same thing except type out the URL as a string, everything works (although I only get one URL).

f = open('inputfile.txt')
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'})
for line in f:
    r = session.get('http://www.ExactSameUrlAsEarlier.com')
    soup = bs4.BeautifulSoup(r.text, "html.parser")

How would I fix/modify this to allow me to cycle through the multiple URLs I have in the file? Just for clarification, this is what the inputfile.txt looks like:

http://www.url1.com/something1
http://www.url2.com/something2

etc.

Thanks in advance.

You should loop over the lines in the file, not the filehandle. Your for loop should be:

for line in f.readlines():
    url = line.strip()

There are other ways of stripping whitespace from the line, have a look at this post: Getting rid of \\n when using .readlines()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM