簡體   English   中英

python urllib2 和 unicode

[英]python urllib2 and unicode

我想從搜索引擎給出的結果中收集信息。 但是我只能在查詢部分寫文本而不是unicode。

import urllib2
a = "바둑"
a = a.decode("utf-8")
type(a)
#Out[35]: unicode

url = "http://search.naver.com/search.naver?where=nexearch&query=%s" %(a)
url2 = urllib2.urlopen(url)

給出這個錯誤

#UnicodeEncodeError: 'ascii' codec can't encode characters in position 39-40: ordinal not in range(128)

將 Unicode 數據編碼為 UTF-8,然后進行 URL 編碼:

from urllib import urlencode
import urllib2

params = {'where': 'nexearch', 'query': a.encode('utf8')}
params = urlencode(params)

url = "http://search.naver.com/search.naver?" + params
response = urllib2.urlopen(url)

演示:

>>> from urllib import urlencode
>>> a = u"바둑"
>>> params = {'where': 'nexearch', 'query': a.encode('utf8')}
>>> params = urlencode(params)
>>> params
'query=%EB%B0%94%EB%91%91&where=nexearch'
>>> url = "http://search.naver.com/search.naver?" + params
>>> url
'http://search.naver.com/search.naver?query=%EB%B0%94%EB%91%91&where=nexearch'

使用urllib.urlencode()構建參數更容易,但您也可以使用urllib.quote_plus()轉義query值:

from urllib import quote_plus
encoded_a = quote_plus(a.encode('utf8'))
url = "http://search.naver.com/search.naver?where=nexearch&query=%s" % encoded_a

或者更容易使用requests庫,它將很好地編碼所有內容:

import requests

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "query": "jackie chan",  # search query
    "where": "nexearch"      # web results
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.79 Safari/537.36",
}

html = requests.get("https://search.naver.com/search.naver", 
                    params=params, 
                    headers=headers, 
                    timeout=30).text

或者,您可以使用來自 SerpApi 的Naver Web Organic Results API抓取 Naver organic results。 這是帶有免費計划的付費 API。

不同之處在於,您不必從 Naver 或其他搜索引擎找出繞過塊,從頭開始構建解析器或維護它。

集成代碼:

import json
from serpapi import NaverSearch


params = {
    "api_key": "serpapi api key",
    "engine": "naver",
    "query": "jackie chan"
}

search = NaverSearch(params)
results = search.get_dict()
    
for result in results["web_results"]:
    print(json.dumps(result, indent=2, ensure_ascii=False))

output的一部分:

{
  "position": 1,
  "title": "The Official Jackie Chan Website",
  "link": "https://jackiechan.com/",
  "displayed_link": "jackiechan.com",
  "snippet": "Shocking News, Jimmy Wang Yu... I received some shocking news today on Ching Ming Festival, Jimmy Wang Yu has passed away ... See the latest photo albums of Jackie... Winter Olympics Torch Relay . Feb 03 The second day of the Beijing Winter Olympics torch relay continued along the Great Wall, with Jackie Chan .... Come visit the Jackie Chan Design Store to see the latest products available... Lunar New Year of the Tiger... The lucky giveaway has ended. Thank you for participating! Build A Schoo"
}
{
  "position": 2,
  "title": "成龍 Jackie Chan - 홈 | Facebook",
  "link": "https://www.facebook.com/jackie",
  "displayed_link": "www.facebook.com›jackie",
  "snippet": "成龍 Jackie Chan. 좋아하는 사람 66,620,835명 · 이야기하고 있는 사람들 70,726명. This is the official Facebook page of international superstar Jackie Chan. Welcome! Jackie's..."
}, ... other results

免責聲明,我為 SerpApi 工作。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM