python urllib2 和 unicode

Question

我想從搜索引擎給出的結果中收集信息。 但是我只能在查詢部分寫文本而不是unicode。

import urllib2
a = "바둑"
a = a.decode("utf-8")
type(a)
#Out[35]: unicode

url = "http://search.naver.com/search.naver?where=nexearch&query=%s" %(a)
url2 = urllib2.urlopen(url)

給出這個錯誤

#UnicodeEncodeError: 'ascii' codec can't encode characters in position 39-40: ordinal not in range(128)

Answer 1

將 Unicode 數據編碼為 UTF-8，然后進行 URL 編碼：

from urllib import urlencode
import urllib2

params = {'where': 'nexearch', 'query': a.encode('utf8')}
params = urlencode(params)

url = "http://search.naver.com/search.naver?" + params
response = urllib2.urlopen(url)

演示：

>>> from urllib import urlencode
>>> a = u"바둑"
>>> params = {'where': 'nexearch', 'query': a.encode('utf8')}
>>> params = urlencode(params)
>>> params
'query=%EB%B0%94%EB%91%91&where=nexearch'
>>> url = "http://search.naver.com/search.naver?" + params
>>> url
'http://search.naver.com/search.naver?query=%EB%B0%94%EB%91%91&where=nexearch'

使用urllib.urlencode()構建參數更容易，但您也可以使用urllib.quote_plus()轉義query值：

from urllib import quote_plus
encoded_a = quote_plus(a.encode('utf8'))
url = "http://search.naver.com/search.naver?where=nexearch&query=%s" % encoded_a

Answer 2

或者更容易使用requests庫，它將很好地編碼所有內容：

import requests

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "query": "jackie chan",  # search query
    "where": "nexearch"      # web results
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.79 Safari/537.36",
}

html = requests.get("https://search.naver.com/search.naver", 
                    params=params, 
                    headers=headers, 
                    timeout=30).text

或者，您可以使用來自 SerpApi 的Naver Web Organic Results API抓取 Naver organic results。 這是帶有免費計划的付費 API。

不同之處在於，您不必從 Naver 或其他搜索引擎找出繞過塊，從頭開始構建解析器或維護它。

集成代碼：

import json
from serpapi import NaverSearch


params = {
    "api_key": "serpapi api key",
    "engine": "naver",
    "query": "jackie chan"
}

search = NaverSearch(params)
results = search.get_dict()
    
for result in results["web_results"]:
    print(json.dumps(result, indent=2, ensure_ascii=False))

output的一部分：

{
  "position": 1,
  "title": "The Official Jackie Chan Website",
  "link": "https://jackiechan.com/",
  "displayed_link": "jackiechan.com",
  "snippet": "Shocking News, Jimmy Wang Yu... I received some shocking news today on Ching Ming Festival, Jimmy Wang Yu has passed away ... See the latest photo albums of Jackie... Winter Olympics Torch Relay . Feb 03 The second day of the Beijing Winter Olympics torch relay continued along the Great Wall, with Jackie Chan .... Come visit the Jackie Chan Design Store to see the latest products available... Lunar New Year of the Tiger... The lucky giveaway has ended. Thank you for participating! Build A Schoo"
}
{
  "position": 2,
  "title": "成龍 Jackie Chan - 홈 | Facebook",
  "link": "https://www.facebook.com/jackie",
  "displayed_link": "www.facebook.com›jackie",
  "snippet": "成龍 Jackie Chan. 좋아하는 사람 66,620,835명 · 이야기하고 있는 사람들 70,726명. This is the official Facebook page of international superstar Jackie Chan. Welcome! Jackie's..."
}, ... other results

免責聲明，我為 SerpApi 工作。

python urllib2 和 unicode

問題描述

2 個解決方案

解決方案1
4 已采納 2014-11-05 16:58:15

解決方案2
0 2022-04-07 15:51:18

python urllib2 和 unicode

問題描述

2 個解決方案

解決方案1 4 已采納 2014-11-05 16:58:15

解決方案2 0 2022-04-07 15:51:18

解決方案1
4 已采納 2014-11-05 16:58:15

解決方案2
0 2022-04-07 15:51:18