简体   繁体   中英

python urllib2 and unicode

I would like to collect information from the results given by a search engine. But I can only write text instead of unicode in the query part.

import urllib2
a = "바둑"
a = a.decode("utf-8")
type(a)
#Out[35]: unicode

url = "http://search.naver.com/search.naver?where=nexearch&query=%s" %(a)
url2 = urllib2.urlopen(url)

give this error

#UnicodeEncodeError: 'ascii' codec can't encode characters in position 39-40: ordinal not in range(128)

Encode the Unicode data to UTF-8, then URL-encode:

from urllib import urlencode
import urllib2

params = {'where': 'nexearch', 'query': a.encode('utf8')}
params = urlencode(params)

url = "http://search.naver.com/search.naver?" + params
response = urllib2.urlopen(url)

Demo:

>>> from urllib import urlencode
>>> a = u"바둑"
>>> params = {'where': 'nexearch', 'query': a.encode('utf8')}
>>> params = urlencode(params)
>>> params
'query=%EB%B0%94%EB%91%91&where=nexearch'
>>> url = "http://search.naver.com/search.naver?" + params
>>> url
'http://search.naver.com/search.naver?query=%EB%B0%94%EB%91%91&where=nexearch'

Using urllib.urlencode() to build the parameters is easier, but you can also just escape the query value withurllib.quote_plus() :

from urllib import quote_plus
encoded_a = quote_plus(a.encode('utf8'))
url = "http://search.naver.com/search.naver?where=nexearch&query=%s" % encoded_a

Or much easier using requests library which will encode everything nicely:

import requests

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "query": "jackie chan",  # search query
    "where": "nexearch"      # web results
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.79 Safari/537.36",
}

html = requests.get("https://search.naver.com/search.naver", 
                    params=params, 
                    headers=headers, 
                    timeout=30).text

Alternatively, you can scrape Naver organic results with Naver Web Organic Results API from SerpApi. It's a paid API with a free plan.

The difference is that you don't have to figure out bypass blocks from Naver or other search engines, build a parser from scratch, or maintain it.

Code to integrate:

import json
from serpapi import NaverSearch


params = {
    "api_key": "serpapi api key",
    "engine": "naver",
    "query": "jackie chan"
}

search = NaverSearch(params)
results = search.get_dict()
    
for result in results["web_results"]:
    print(json.dumps(result, indent=2, ensure_ascii=False))

Part of the output:

{
  "position": 1,
  "title": "The Official Jackie Chan Website",
  "link": "https://jackiechan.com/",
  "displayed_link": "jackiechan.com",
  "snippet": "Shocking News, Jimmy Wang Yu... I received some shocking news today on Ching Ming Festival, Jimmy Wang Yu has passed away ... See the latest photo albums of Jackie... Winter Olympics Torch Relay . Feb 03 The second day of the Beijing Winter Olympics torch relay continued along the Great Wall, with Jackie Chan .... Come visit the Jackie Chan Design Store to see the latest products available... Lunar New Year of the Tiger... The lucky giveaway has ended. Thank you for participating! Build A Schoo"
}
{
  "position": 2,
  "title": "成龍 Jackie Chan - 홈 | Facebook",
  "link": "https://www.facebook.com/jackie",
  "displayed_link": "www.facebook.com›jackie",
  "snippet": "成龍 Jackie Chan. 좋아하는 사람 66,620,835명 · 이야기하고 있는 사람들 70,726명. This is the official Facebook page of international superstar Jackie Chan. Welcome! Jackie's..."
}, ... other results

Disclaimer, I work for SerpApi.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM