I created a python 3 script that allows me to search on a search engine (DuckDuckGo), get the HTML source code and write it in a textfile.
import pycurl
from io import BytesIO
buffer = BytesIO()
c = pycurl.Curl()
c.setopt(c.URL, 'https://duckduckgo.com/?q=test')
c.setopt(c.WRITEDATA, buffer)
c.setopt(c.FOLLOWLOCATION, True)
c.perform()
c.close()
body = buffer.getvalue()
with open("output.htm", "w") as text_file:
text_file.write(str(body))
print(body.decode('iso-8859-1'))
That part of the code is working properly. However, when I try to open the output.htm
file containing the HTML source code of the search engine, I don't get anything (I get an input
with my search topic written inside). I would like to have the same HTML source code that I would get by running curl https://duckduckgo.com/?q=test
on my terminal.
Duckduckgo's html pages uses javascript to load their search result into their html markups, so curl
or PyCurl
will not be able to get the same html content you'd see in a browser since curl
/ pycurl
merely fetches internet resources but does not provide any javascript processing.
Use https://duckduckgo.com/api instead of scraping to find search results in their servers/databases.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.