I have a program that reads some URLs from a text file, gets the page source with requests.get, and then uses beautifulsoup4 to find some information.
f = open('inputfile.txt')
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'})
for line in f:
x = 0
z = len(line)
r = session.get(line[x:z])
soup = bs4.BeautifulSoup(r.text, "html.parser")
This returns an HTTP 400 Bad Request - Invalid URL. However, when I do the same thing except type out the URL as a string, everything works (although I only get one URL).
f = open('inputfile.txt')
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'})
for line in f:
r = session.get('http://www.ExactSameUrlAsEarlier.com')
soup = bs4.BeautifulSoup(r.text, "html.parser")
How would I fix/modify this to allow me to cycle through the multiple URLs I have in the file? Just for clarification, this is what the inputfile.txt looks like:
http://www.url1.com/something1
http://www.url2.com/something2
etc.
Thanks in advance.
You should loop over the lines in the file, not the filehandle. Your for loop should be:
for line in f.readlines():
url = line.strip()
There are other ways of stripping whitespace from the line, have a look at this post: Getting rid of \\n when using .readlines()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.