python urllib2 和 unicode

Question

我想从搜索引擎给出的结果中收集信息。 但是我只能在查询部分写文本而不是unicode。

import urllib2
a = "바둑"
a = a.decode("utf-8")
type(a)
#Out[35]: unicode

url = "http://search.naver.com/search.naver?where=nexearch&query=%s" %(a)
url2 = urllib2.urlopen(url)

给出这个错误

#UnicodeEncodeError: 'ascii' codec can't encode characters in position 39-40: ordinal not in range(128)

Answer 1

将 Unicode 数据编码为 UTF-8，然后进行 URL 编码：

from urllib import urlencode
import urllib2

params = {'where': 'nexearch', 'query': a.encode('utf8')}
params = urlencode(params)

url = "http://search.naver.com/search.naver?" + params
response = urllib2.urlopen(url)

演示：

>>> from urllib import urlencode
>>> a = u"바둑"
>>> params = {'where': 'nexearch', 'query': a.encode('utf8')}
>>> params = urlencode(params)
>>> params
'query=%EB%B0%94%EB%91%91&where=nexearch'
>>> url = "http://search.naver.com/search.naver?" + params
>>> url
'http://search.naver.com/search.naver?query=%EB%B0%94%EB%91%91&where=nexearch'

使用urllib.urlencode()构建参数更容易，但您也可以使用urllib.quote_plus()转义query值：

from urllib import quote_plus
encoded_a = quote_plus(a.encode('utf8'))
url = "http://search.naver.com/search.naver?where=nexearch&query=%s" % encoded_a

Answer 2

或者更容易使用requests库，它将很好地编码所有内容：

import requests

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "query": "jackie chan",  # search query
    "where": "nexearch"      # web results
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.79 Safari/537.36",
}

html = requests.get("https://search.naver.com/search.naver", 
                    params=params, 
                    headers=headers, 
                    timeout=30).text

或者，您可以使用来自 SerpApi 的Naver Web Organic Results API抓取 Naver organic results。 这是带有免费计划的付费 API。

不同之处在于，您不必从 Naver 或其他搜索引擎找出绕过块，从头开始构建解析器或维护它。

集成代码：

import json
from serpapi import NaverSearch


params = {
    "api_key": "serpapi api key",
    "engine": "naver",
    "query": "jackie chan"
}

search = NaverSearch(params)
results = search.get_dict()
    
for result in results["web_results"]:
    print(json.dumps(result, indent=2, ensure_ascii=False))

output的一部分：

{
  "position": 1,
  "title": "The Official Jackie Chan Website",
  "link": "https://jackiechan.com/",
  "displayed_link": "jackiechan.com",
  "snippet": "Shocking News, Jimmy Wang Yu... I received some shocking news today on Ching Ming Festival, Jimmy Wang Yu has passed away ... See the latest photo albums of Jackie... Winter Olympics Torch Relay . Feb 03 The second day of the Beijing Winter Olympics torch relay continued along the Great Wall, with Jackie Chan .... Come visit the Jackie Chan Design Store to see the latest products available... Lunar New Year of the Tiger... The lucky giveaway has ended. Thank you for participating! Build A Schoo"
}
{
  "position": 2,
  "title": "成龍 Jackie Chan - 홈 | Facebook",
  "link": "https://www.facebook.com/jackie",
  "displayed_link": "www.facebook.com›jackie",
  "snippet": "成龍 Jackie Chan. 좋아하는 사람 66,620,835명 · 이야기하고 있는 사람들 70,726명. This is the official Facebook page of international superstar Jackie Chan. Welcome! Jackie's..."
}, ... other results

免责声明，我为 SerpApi 工作。

python urllib2 和 unicode

问题描述

2 个解决方案

解决方案1
4 已采纳 2014-11-05 16:58:15

解决方案2
0 2022-04-07 15:51:18

python urllib2 和 unicode

问题描述

2 个解决方案

解决方案1 4 已采纳 2014-11-05 16:58:15

解决方案2 0 2022-04-07 15:51:18

解决方案1
4 已采纳 2014-11-05 16:58:15

解决方案2
0 2022-04-07 15:51:18