I would like to collect information from the results given by a search engine. But I can only write text instead of unicode in the query part.
import urllib2
a = "바둑"
a = a.decode("utf-8")
type(a)
#Out[35]: unicode
url = "http://search.naver.com/search.naver?where=nexearch&query=%s" %(a)
url2 = urllib2.urlopen(url)
give this error
#UnicodeEncodeError: 'ascii' codec can't encode characters in position 39-40: ordinal not in range(128)
Encode the Unicode data to UTF-8, then URL-encode:
from urllib import urlencode
import urllib2
params = {'where': 'nexearch', 'query': a.encode('utf8')}
params = urlencode(params)
url = "http://search.naver.com/search.naver?" + params
response = urllib2.urlopen(url)
Demo:
>>> from urllib import urlencode
>>> a = u"바둑"
>>> params = {'where': 'nexearch', 'query': a.encode('utf8')}
>>> params = urlencode(params)
>>> params
'query=%EB%B0%94%EB%91%91&where=nexearch'
>>> url = "http://search.naver.com/search.naver?" + params
>>> url
'http://search.naver.com/search.naver?query=%EB%B0%94%EB%91%91&where=nexearch'
Using urllib.urlencode()
to build the parameters is easier, but you can also just escape the query
value withurllib.quote_plus()
:
from urllib import quote_plus
encoded_a = quote_plus(a.encode('utf8'))
url = "http://search.naver.com/search.naver?where=nexearch&query=%s" % encoded_a
Or much easier using requests
library which will encode everything nicely:
import requests
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"query": "jackie chan", # search query
"where": "nexearch" # web results
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.79 Safari/537.36",
}
html = requests.get("https://search.naver.com/search.naver",
params=params,
headers=headers,
timeout=30).text
Alternatively, you can scrape Naver organic results with Naver Web Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to figure out bypass blocks from Naver or other search engines, build a parser from scratch, or maintain it.
Code to integrate:
import json
from serpapi import NaverSearch
params = {
"api_key": "serpapi api key",
"engine": "naver",
"query": "jackie chan"
}
search = NaverSearch(params)
results = search.get_dict()
for result in results["web_results"]:
print(json.dumps(result, indent=2, ensure_ascii=False))
Part of the output:
{
"position": 1,
"title": "The Official Jackie Chan Website",
"link": "https://jackiechan.com/",
"displayed_link": "jackiechan.com",
"snippet": "Shocking News, Jimmy Wang Yu... I received some shocking news today on Ching Ming Festival, Jimmy Wang Yu has passed away ... See the latest photo albums of Jackie... Winter Olympics Torch Relay . Feb 03 The second day of the Beijing Winter Olympics torch relay continued along the Great Wall, with Jackie Chan .... Come visit the Jackie Chan Design Store to see the latest products available... Lunar New Year of the Tiger... The lucky giveaway has ended. Thank you for participating! Build A Schoo"
}
{
"position": 2,
"title": "成龍 Jackie Chan - 홈 | Facebook",
"link": "https://www.facebook.com/jackie",
"displayed_link": "www.facebook.com›jackie",
"snippet": "成龍 Jackie Chan. 좋아하는 사람 66,620,835명 · 이야기하고 있는 사람들 70,726명. This is the official Facebook page of international superstar Jackie Chan. Welcome! Jackie's..."
}, ... other results
Disclaimer, I work for SerpApi.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.