如何使用 python 从 html 中的动态表中刮取所有行

Question

Here's link for scraping: http://5000best.com/websites/Games/这是抓取的链接： http://5000best.com/websites/Games/

I tried almost everything I can.我几乎尽我所能。 I'm a beginner in web scraping.我是 web 抓取的初学者。

My code:我的代码：

import requests
from  urllib.request import  urlopen
from urllib.error import  HTTPError
from urllib.error import  URLError
from bs4 import  BeautifulSoup
import pandas as pd
import csv


try:
    html = urlopen("http://5000best.com/websites/Games/")

except HTTPError as e:
    print(e)

except URLError as u:
    print(u)

else:
    soup = BeautifulSoup(html,"html.parser")
    table = soup.findAll('div',{"id":"content"})[0]
    tr = table.findAll(['tr'])[0:]
    csvFile = open('games.csv','wt', newline='',encoding='utf-8')
    writer = csv.writer(csvFile)
    try:   
        for cell in tr:
            th = cell.find_all('th')
            th_data = [col.text.strip('\n') for col in th]
            td = cell.find_all('td')
            row = [i.text.replace('\n','') for i in td]
            writer.writerow(th_data+row)      

    finally:   
        csvFile.close()

This code only scrape the first page of the table... I want all the pages.这段代码只抓取表格的第一页......我想要所有的页面。 I inspected the web page but I didn't saw any url changes while toggling the page numbers, So it's completely dynamic.我检查了 web 页面，但在切换页码时我没有看到任何 url 变化，所以它是完全动态的。

Answer 1

You can read it directly using pandas.read_html() function as a DataFrame which will do it easily for you.您可以使用pandas.read_html() function 作为DataFrame直接阅读它，这将为您轻松完成。

import pandas as pd


def main(url):
    for item in range(1, 4):
        df = pd.read_html(url.format(item))[1]
        print(df)


main("http://5000best.com/websites/Games/{}/")

Sample of output: output样品：

CSV edit: CSV编辑：

import pandas as pd


def main(url):
    for item in range(1, 4):
        df = pd.read_html(url.format(item))[1]
        print(f"Saving Page {item}")
        df.to_csv(f"page{item}.csv", index=False)


main("http://5000best.com/websites/Games/{}/")

Code updated for single DataFrame :为单个DataFrame更新代码：

import pandas as pd


def main(url):
    goal = []
    for item in range(1, 4):
        df = pd.read_html(url.format(item))[1]
        goal.append(df)
    final = pd.concat(goal)
    print(final)


main("http://5000best.com/websites/Games/{}/")

Answer 2

Looking at the network inspector for that page reveals that it makes requests to查看该页面的网络检查器会发现它向

when you change pages.当您更改页面时。 You may want to just scrape those instead.您可能只想刮掉那些。

Answer 3

Let me try to help you understand.让我试着帮助你理解。

Have you used the developer tools in your browser?您是否在浏览器中使用过开发人员工具？ Open that (Use F12 or right click > inspect element) and select the network tab.打开它（使用 F12 或右键单击 > 检查元素）和 select 网络选项卡。 Now while keeping the tab open, click on the next page link.现在，在保持选项卡打开的同时，单击下一页链接。 A request shows up in the Network Tab.网络选项卡中会显示一个请求。

This is what you are looking for.这就是你要找的。 All dynamic thing on a web page can be viewed here. web 页面上的所有动态内容都可以在此处查看。

Hope this helps you learn something.希望这可以帮助你学到一些东西。 Cheers!干杯!

如何使用 python 从 html 中的动态表中刮取所有行

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-05-11 07:52:28

解决方案2
0 2020-05-11 07:30:35

解决方案3
0 2020-05-11 07:33:32

如何使用 python 从 html 中的动态表中刮取所有行

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-05-11 07:52:28

解决方案2 0 2020-05-11 07:30:35

解决方案3 0 2020-05-11 07:33:32

解决方案1
1 已采纳 2020-05-11 07:52:28

解决方案2
0 2020-05-11 07:30:35

解决方案3
0 2020-05-11 07:33:32