简体   繁体   English

如何使用 python 从 html 中的动态表中刮取所有行

[英]How to scrape all rows from a dynamic table in html using python

Here's link for scraping: http://5000best.com/websites/Games/这是抓取的链接: http://5000best.com/websites/Games/

I tried almost everything I can.我几乎尽我所能。 I'm a beginner in web scraping.我是 web 抓取的初学者。

My code:我的代码:

import requests
from  urllib.request import  urlopen
from urllib.error import  HTTPError
from urllib.error import  URLError
from bs4 import  BeautifulSoup
import pandas as pd
import csv


try:
    html = urlopen("http://5000best.com/websites/Games/")

except HTTPError as e:
    print(e)

except URLError as u:
    print(u)

else:
    soup = BeautifulSoup(html,"html.parser")
    table = soup.findAll('div',{"id":"content"})[0]
    tr = table.findAll(['tr'])[0:]
    csvFile = open('games.csv','wt', newline='',encoding='utf-8')
    writer = csv.writer(csvFile)
    try:   
        for cell in tr:
            th = cell.find_all('th')
            th_data = [col.text.strip('\n') for col in th]
            td = cell.find_all('td')
            row = [i.text.replace('\n','') for i in td]
            writer.writerow(th_data+row)      

    finally:   
        csvFile.close()

This code only scrape the first page of the table... I want all the pages.这段代码只抓取表格的第一页......我想要所有的页面。 I inspected the web page but I didn't saw any url changes while toggling the page numbers, So it's completely dynamic.我检查了 web 页面,但在切换页码时我没有看到任何 url 变化,所以它是完全动态的。

You can read it directly using pandas.read_html() function as a DataFrame which will do it easily for you.您可以使用pandas.read_html() function 作为DataFrame直接阅读它,这将为您轻松完成。

import pandas as pd


def main(url):
    for item in range(1, 4):
        df = pd.read_html(url.format(item))[1]
        print(df)


main("http://5000best.com/websites/Games/{}/")

Sample of output: output样品:

在此处输入图像描述

CSV edit: CSV编辑:

import pandas as pd


def main(url):
    for item in range(1, 4):
        df = pd.read_html(url.format(item))[1]
        print(f"Saving Page {item}")
        df.to_csv(f"page{item}.csv", index=False)


main("http://5000best.com/websites/Games/{}/")

Code updated for single DataFrame :为单个DataFrame更新代码:

import pandas as pd


def main(url):
    goal = []
    for item in range(1, 4):
        df = pd.read_html(url.format(item))[1]
        goal.append(df)
    final = pd.concat(goal)
    print(final)


main("http://5000best.com/websites/Games/{}/")

Looking at the network inspector for that page reveals that it makes requests to查看该页面的网络检查器会发现它向

when you change pages.当您更改页面时。 You may want to just scrape those instead.您可能只想刮掉那些。

Let me try to help you understand.让我试着帮助你理解。

Have you used the developer tools in your browser?您是否在浏览器中使用过开发人员工具? Open that (Use F12 or right click > inspect element) and select the network tab.打开它(使用 F12 或右键单击 > 检查元素)和 select 网络选项卡。 Now while keeping the tab open, click on the next page link.现在,在保持选项卡打开的同时,单击下一页链接。 A request shows up in the Network Tab.网络选项卡中会显示一个请求。

This is what you are looking for.这就是你要找的。 All dynamic thing on a web page can be viewed here. web 页面上的所有动态内容都可以在此处查看。

Hope this helps you learn something.希望这可以帮助你学到一些东西。 Cheers!干杯!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM