使用Beautifulsoup和Requests刮取“ N”頁（如何獲取真實的頁碼）

Question

我想獲取網站上的所有titles（）。

http://www.shyan.gov.cn/zwhd/web/webindex.action

現在，我的代碼僅成功刮取一頁。 但是，我想在上面的站點上刮取多個頁面。

例如，使用上面的URL，當我單擊指向“第2頁”的鏈接時，整個URL不會改變。 我查看了頁面源代碼，並看到javascript代碼可以像下面這樣進入下一頁：javascript：gotopage（2）或javascript：void（0）。 我的代碼在這里（獲取第1頁）

from bs4 import Beautifulsoup
import requests
url = 'http://www.shyan.gov.cn/zwhd/web/webindex.action'
r =  requests.get(url)
soup = Beautifulsoup(r.content,'lxml')
titles = soup.select('td.tit3 > a')
for title in titles:
    print(title.get_text())

如何更改我的代碼以從列出的所有可用頁面中抓取標題？ 非常感謝你！

Answer 1

嘗試使用以下URL格式：

http://www.shiyan.gov.cn/zwhd/web/webindex.action?keyWord=&searchType=3&page.currentpage=2&page.pagesize=15&page.pagecount=2357&docStatus=&sendOrg=

該站點使用javascript將隱藏的頁面信息傳遞到服務器，以請求下一頁。 當您查看源代碼時，您會發現：

<form action="/zwhd/web/webindex.action" id="searchForm" name="searchForm" method="post">
 <div class="item">
     <div class="titlel">
      <span>留言查詢</span>
     <label class="dow"></label>
     </div>
     <input type="text" name="keyWord" id="keyword" value="" class="text"/>
     <div class="key">
        <ul>
            <li><span><input type="radio" checked="checked" value="3" name="searchType"/></span><p>編號</p></li>
            <li><span><input type="radio" value="2" name="searchType"/></span><p>關鍵字</p></li>
        </ul>    
     </div>
     <input type="button" class="btn1" onclick="search();" value="查詢"/>
  </div>
  <input type="hidden" id="pageIndex" name="page.currentpage" value="2"/>
  <input type="hidden" id="pageSize" name="page.pagesize" value="15"/>
  <input type="hidden" id="pageCount" name="page.pagecount" value="2357"/>
  <input type="hidden" id="docStatus" name="docStatus" value=""/>
  <input type="hidden" id="sendorg" name="sendOrg" value=""/>
  </form>

使用Beautifulsoup和Requests刮取“ N”頁（如何獲取真實的頁碼）

問題描述

1 個解決方案

解決方案1
1 已采納 2016-04-18 17:01:15

使用Beautifulsoup和Requests刮取“ N”頁（如何獲取真實的頁碼）

問題描述

1 個解決方案

解決方案1 1 已采納 2016-04-18 17:01:15

解決方案1
1 已采納 2016-04-18 17:01:15