簡體   English   中英

無法從 beautiful_soup 對象中提取數據

[英]Can't extract data from beautiful_soup object

我在爬一個網站( https://www.zhihu.com/people/xie-ke-41/followers ),我想獲取所有關注者的信息。 正如你所看到的,一些關注者的信息是用 AJAX 帶來的,我在 chrome 中使用了開發者的工具,並找到了 包含關注者信息的 url

我的代碼:

import requests
from bs4 import BeautifulSoup


zhihu_rl = 'https://www.zhihu.com/node/ProfileFollowersListV2'

data = {
'method': 'next',
'params': '{"offset":20,"order_by":"created","hash_id":"86858a7a4aa77d290364625efcaacb70"}'}

headers = {
'Host': 'www.zhihu.com',
'Origin': 'https://www.zhihu.com',
'Referer': 'https://www.zhihu.com/people/xie-ke-41/followers',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'X-Xsrftoken': 'foo',
'Cookie':'xxxxxxxxxxxx'}

rep = requests.post(url=zhihu_rl, data=data, headers=headers)

bsobj = BeautifulSoup(rep.text,  'html.parser')

print(bsobj.find_all('div', {'class': "zm-profile-card zm-profile-section-item zg-clear no-hovercard"}))

並返回一個空列表。 我可以看到信息是開發人員的工具: 我在開發人員工具中看到的 thr 信息 ,為什么bs4不能解壓出來? PS:我可以得到所有的div,但是當我限制屬性時。它失敗了

問題是您已經轉義了 json,如果您打印 bsobj,您可以看到如下輸出:

{"r":0,
 "msg": ["<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"6327483c9e474097e7dbb2493a7f277c\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8\u4ed6<\/button>\n<\/div>\n<a title=\"\u738b\u5728\u9014\"\ndata-hovercard=\"p$t$wang-zai-tu-81\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/wang-zai-tu-81\">\n<img src=\"https:\/\/pic1.zhimg.com\/da8e974dc_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$wang-zai-tu-81\" href=\"https:\/\/www.zhihu.com\/people\/wang-zai-tu-81\" class=\"zg-link author-link\" title=\"\u738b\u5728\u9014\"\n>\u738b\u5728\u9014<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\"><\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/wang-zai-tu-81\/followers\" class=\"zg-link-gray-normal\">1 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/wang-zai-tu-81\/asks\" class=\"zg-link-gray-normal\">0 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/wang-zai-tu-81\/answers\" class=\"zg-link-gray-normal\">1 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/wang-zai-tu-81\" class=\"zg-link-gray-normal\">0 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"a3596eaecae6f05f0ddf95dfcc6b5517\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8<\/button>\n<\/div>\n<a title=\"\u7075\u9b42\"\ndata-hovercard=\"p$t$ling-hun-30-21\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/ling-hun-30-21\">\n<img src=\"https:\/\/pic1.zhimg.com\/da8e974dc_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$ling-hun-30-21\" href=\"https:\/\/www.zhihu.com\/people\/ling-hun-30-21\" class=\"zg-link author-link\" title=\"\u7075\u9b42\"\n>\u7075\u9b42<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\"><\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/ling-hun-30-21\/followers\" class=\"zg-link-gray-normal\">0 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/ling-hun-30-21\/asks\" class=\"zg-link-gray-normal\">0 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/ling-hun-30-21\/answers\" class=\"zg-link-gray-normal\">0 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/ling-hun-30-21\" class=\"zg-link-gray-normal\">0 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"74fad3af2b93f7da69c37eda64c31037\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8<\/button>\n<\/div>\n<a title=\"\u5f90\u6668\"\ndata-hovercard=\"p$t$xu-chen-77-49\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/xu-chen-77-49\">\n<img src=\"https:\/\/pic1.zhimg.com\/da8e974dc_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$xu-chen-77-49\" href=\"https:\/\/www.zhihu.com\/people\/xu-chen-77-49\" class=\"zg-link author-link\" title=\"\u5f90\u6668\"\n>\u5f90\u6668<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\">\u4f1a\u8ba1\u5e08<\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/xu-chen-77-49\/followers\" class=\"zg-link-gray-normal\">0 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/xu-chen-77-49\/asks\" class=\"zg-link-gray-normal\">0 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/xu-chen-77-49\/answers\" class=\"zg-link-gray-normal\">0 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/xu-chen-77-49\" class=\"zg-link-gray-normal\">0 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"032b36abfbe05a30913c794a4b099629\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8\u5979<\/button>\n<\/div>\n<a title=\"Shuai Zhang\"\ndata-hovercard=\"p$t$shuai-zhang-49\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/shuai-zhang-49\">\n<img src=\"https:\/\/pic2.zhimg.com\/v2-8aa42ff00873460e29444d62ff51acfd_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$shuai-zhang-49\" href=\"https:\/\/www.zhihu.com\/people\/shuai-zhang-49\" class=\"zg-link author-link\" title=\"Shuai Zhang\"\n>Shuai Zhang<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\"><\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/shuai-zhang-49\/followers\" class=\"zg-link-gray-normal\">79 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/shuai-zhang-49\/asks\" class=\"zg-link-gray-normal\">1 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/shuai-zhang-49\/answers\" class=\"zg-link-gray-normal\">119 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/shuai-zhang-49\" class=\"zg-link-gray-normal\">174 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"6388162f5357ca1bd872dc0b6efe4802\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8\u4ed6<\/button>\n<\/div>\n<a title=\"\u5468\u5468\"\ndata-hovercard=\"p$t$zhou-zhou-69-22\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/zhou-zhou-69-22\">\n<img src=\"https:\/\/pic1.zhimg.com\/da8e974dc_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$zhou-zhou-69-22\" href=\"https:\/\/www.zhihu.com\/people\/zhou-zhou-69-22\" class=\"zg-link author-link\" title=\"\u5468\u5468\"\n>\u5468\u5468<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\"><\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/zhou-zhou-69-22\/followers\" class=\"zg-link-gray-normal\">4 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/zhou-zhou-69-22\/asks\" class=\"zg-link-gray-normal\">0 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/zhou-zhou-69-22\/answers\" class=\"zg-link-gray-normal\">7 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/zhou-zhou-69-22\" class=\"zg-link-gray-normal\">1 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"3a1a9da0e0bb4abe2554fa2a6032f27f\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8\u5979<\/button>\n<\/div>\n<a title=\"\u7f8e\u7f8e\u836f\u5242\u5e08\"\ndata-hovercard=\"p$t$sui-nuo-81\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/sui-nuo-81\">\n<img src=\"https:\/\/pic2.zhimg.com\/ae23b8e89725a24de650dee53e9a60a5_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$sui-nuo-81\" href=\"https:\/\/www.zhihu.com\/people\/sui-nuo-81\" class=\"zg-link 

不幸的是,它也是無效的json,因此我們無法調用req.json()並獲得漂亮的未轉義 html,因此您必須使用string_escape手動執行此操作:

In [14]: rep = requests.post(url=zhihu_rl, data=data, headers=headers)

In [15]: bsobj = BeautifulSoup(rep.text.decode("string_escape"),  'lxml')

In [16]: ancs = (bsobj.find_all('div', {'class': 'zm-profile-card zm-profile-section-item zg-clear no-hovercard'}))

In [17]: len(ancs)
Out[17]: 20

它也是zm-profile-section-item而不是zm-profile-section- item zm-profile-section-item zm-profile-section- item

此外,將來永遠不要發布登錄 cookie,我可以在幾分鍾內完全訪問您的帳戶。

您使用了良好的標頭組合,否則服務器可能無法識別您的標頭,並認為您沒有啟用 javascript。 在限制屬性使用中。 用於類和 # 用於 id。 其他 CSS 選擇器也可以正常工作。 您還需要使用Selenium進行 javascript 執行(ajax 調用),因為 Beautifulsoup 缺少此功能最后,請確保網站沒有防抓取保護。 在這種情況下,您需要使用像Js2Py這樣的 javascript 運行時

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM