I was able to gather data from a web page using this
import requests
import lxml.html
import re
url = "http://animesora.com/flying-witch-episode-7-english-subtitle/"
r = requests.get(url)
page = r.content
dom = lxml.html.fromstring(page)
for link in dom.xpath('//div[@class="downloadarea"]//a/@href'):
down = re.findall('https://.*',link)
print (down)
when I try this to gather more data on the results of the above code I was presented with this error:
Traceback (most recent call last):
File "/home/sven/PycharmProjects/untitled1/.idea/test4.py", line 21, in <module>
r2 = requests.get(down)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 475, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 590, in send
adapter = self.get_adapter(url=request.url)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 672, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '['https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC9zTGZYZ0s=&c=1&user=51757']'
This is the code I was using:
for link2 in down:
r2 = requests.get(down)
page2 = r.url
dom2 = lxml.html.fromstring(page2)
for link2 in dom2('//div[@class="button green"]//onclick'):
down2 = re.findall('.*',down2)
print (down2)
You are passing in the whole list :
for link2 in down:
r2 = requests.get(down)
Note how you passed in down
, not link2
. down
is a list, not a single URL string.
Pass in link2
:
for link2 in down:
r2 = requests.get(link2)
I'm not sure why you are using regular expressions. In the loop
for link in dom.xpath('//div[@class="downloadarea"]//a/@href'):
each link
is already a fully qualified URL:
>>> for link in dom.xpath('//div[@class="downloadarea"]//a/@href'):
... print link
...
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC9FZEk2Qg==&c=1&user=51757
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC95Tmg2Qg==&c=1&user=51757
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC93dFBmVFg=&c=1&user=51757
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC9zTGZYZ0s=&c=1&user=51757
You don't need to do any further processing on that.
Your remaining code has more errors; you confused r2.url
with r2.content
, and forgot the .xpath
part in your dom2.xpath(...)
query.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.