當 url 更改並添加 'offset=[# here]' 時，Web 抓取多個頁面

Question

from bs4 import BeautifulSoup
import pandas as pd
import requests

r = requests.get('https://reelgood.com/source/netflix')
soup = BeautifulSoup(r.text, 'html.parser')

title = soup.find_all('tr',attrs={'class':'cM'})

records = []
for t in title:
    movie = t.find(attrs={'class':'cI'}).text
    year = t.find(attrs={'class':'cJ'}).findNext('td').text
    rating = t.find(attrs={'class':'cJ'}).findNext('td').findNext('td').text
    score = t.find(attrs={'class':'cJ'}).findNext('td').findNext('td').findNext('td').text
    rottenTomatoe = t.find(attrs={'class':'cJ'}).findNext('td').findNext('td').findNext('td').findNext('td').text
    episodes = t.find(attrs={'class':'c0'}).text[:3]
    records.append([movie, year, rating, score, rottenTomatoe, episodes])

df = pd.DataFrame(records, columns=['movie', 'year', 'rating', 'score', 'rottenTomatoe', 'episodes'])

上面的代碼讓我得到了 49 條記錄，這是第一頁。 我想刮 43 頁。 每次轉到下一頁以獲取接下來的 50 個視頻時，最初從第一頁到第二頁的 url 都會添加“?offset=150”，然后在它之后的每一頁都增加 100。這是 url 外觀的示例就像最后一頁一樣（你可以看到 offset=4250）“ https://reelgood.com/source/netflix?offset=4250 ”

關於如何獲得所有頁面的結果集的任何幫助都會非常有幫助。 謝謝你

Answer 1

我想最簡單的方法就是獲取更多內容鏈接所在的 class='eH'。

它是頁面上唯一具有該值的類。 當您到達 offset=4250 時，鏈接消失了。

所以循環會是這樣的：

records = []
keep_looping = True
url = "https://reelgood.com/source/netflix"
while keep_looping:
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")
    # grab your content here and store it and find the next link to visit.
    title = soup.find....
    for t in title:
        ....
        records.append...
    # if the tag does not exist, url will be None
    # we will then tell the while-loop to stop by setting the keep_looping flag to False"
    url_tag = soup.find('a', class_='eH')
    # returns not absolute urls but "/source/netflix?offset=150"
    if not url_tag:
        keep_looping = False
    else:
        url = "https://www.reelgood.com" + url_tag.get('href')
df = pd.DataFrame...

Answer 2

我在雷爾古德工作。 請注意，每次我們發布 Web 應用程序更新時， https: //reelgood.com 上的類名稱都會更改。

我們非常樂意為您在這里嘗試完成的任何事情提供幫助，請隨時通過 luigi@reelgood.com 向我發送電子郵件。

當 url 更改並添加 'offset=[# here]' 時，Web 抓取多個頁面

問題描述

2 個解決方案

解決方案1
1 已采納 2018-06-15 21:32:20

解決方案2
0 2018-06-15 21:25:53

當 url 更改並添加 &#39;offset=[# here]&#39; 時，Web 抓取多個頁面

問題描述

2 個解決方案

解決方案1 1 已采納 2018-06-15 21:32:20

解決方案2 0 2018-06-15 21:25:53

當 url 更改並添加 'offset=[# here]' 時，Web 抓取多個頁面

解決方案1
1 已采納 2018-06-15 21:32:20

解決方案2
0 2018-06-15 21:25:53