简体   繁体   English

使用 Python BeautifulSoup 抓取表格时遇到问题

[英]Trouble Scraping a Table with Python BeautifulSoup

I'm trying to scrape the table data from this website: https://www.playnj.com/atlantic-city/revenue/我试图从这个网站上抓取表格数据: https : //www.playnj.com/atlantic-city/revenue/

Yet when I try to print the table, it returns None.然而,当我尝试打印表格时,它返回 None。 Can someone assist me with this?有人可以帮助我吗?

Here is my code:这是我的代码:

from bs4 import BeautifulSoup
import requests
import pandas as pd
base_url = 'https://www.playnj.com/atlantic-city/revenue/'
resp = requests.get(base_url)
soup = BeautifulSoup(resp.text, "html.parser")
october_table = soup.find('table', {'id': 'tablepress-342-no-2'})
print(october_table)

This returns None and I am unsure as to why - Ideally (and perhaps I am wrong here) - If my objective is to get ALL the data from ALL the tables it is more efficient to use the same class wrapper as all the tables and I would use the following 2 lines instead (but maybe not).这将返回 None 并且我不确定为什么 - 理想情况下(也许我在这里错了) - 如果我的目标是从所有表中获取所有数据,则使用与所有表相同的类包装器更有效,我将使用以下 2 行(但可能不会)。

all_tables = soup.findAll('table', {'class': 'dataTables_wrapper no-footer'})
print(all_tables)

However this also returns None.然而,这也返回 None。 Any help here would be immensely appreciated.这里的任何帮助将不胜感激。

import pandas as pd
import requests

headers = {"User-Agent": "Mozilla/5.0"}

df = pd.read_html(requests.get(
    "https://www.playnj.com/atlantic-city/revenue/", headers=headers).text)[0]

df.to_csv("out.csv", index=False)

Output:输出:

          Casino Table & Other       Poker Slot Machines Total Gaming Win
0        Bally's    $3,441,617    $183,255    $9,780,559      $13,405,431
1        Borgata   $16,744,564  $1,631,575   $40,669,801      $59,045,940
2        Caesars   $13,785,260         $ -   $14,530,482      $28,315,742
3  Golden Nugget    $5,237,258     $92,647   $11,728,116      $17,058,021
4      Hard Rock    $7,155,391         $ -   $16,338,090      $23,493,481
5       Harrah's    $5,555,330    $222,323   $19,794,846      $25,572,499
6   Ocean Resort    $4,965,900     $82,686   $14,459,903      $19,508,489
7        Resorts    $3,328,916         $ -   $10,566,342      $13,895,258
8      Tropicana    $4,531,234    $159,957   $18,957,670      $23,648,861
9          Total   $64,745,470  $2,372,443  $156,825,809     $223,943,722

CSV File: view-online CSV 文件: 在线查看

Request with headers:带有标题的请求:

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:72.0) Gecko/20100101 Firefox/72.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'ru-RU,ru;q=0.8,en;q=0.6,en-US;q=0.4,tr;q=0.2',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
}

resp = requests.get('https://www.playnj.com/atlantic-city/revenue/', headers=headers)
soup = BeautifulSoup(resp.text, "html.parser")
tables = soup.select('table.tablepress')

It seems this page check User-Agent header.似乎此页面检查User-Agent标头。

It works even with incomplete "User-Agent": "Mozilla/5.0"它甚至适用于不完整的"User-Agent": "Mozilla/5.0"

BTW: this table has different ID: 'id': 'tablepress-342'顺便说一句:这个表有不同的 ID: 'id': 'tablepress-342'


import requests
from bs4 import BeautifulSoup

url = 'https://www.playnj.com/atlantic-city/revenue/'
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
print(r.status_code)

soup = BeautifulSoup(r.text, "html.parser")

october_table = soup.find('table', {'id': 'tablepress-342'})
#print(october_table)
for row in october_table.find_all('tr'):
    for item in row.find_all('td'):
        print(item.text)
    print('---')

Result结果

200
---
Bally's
$3,799,907 
$180,229 
$9,107,610 
$13,087,746 
---
Borgata
$14,709,145 
$1,060,246 
$35,731,777 
 $51,501,168 
---
Caesars
$7,097,502 
$ -
$14,689,045 
$21,786,547 
---
Golden Nugget
$3,311,223 
$84,387 
$11,356,285 
$14,751,895 
---
Hard Rock
$7,849,617 
$ -
$16,619,183 
$24,468,800 
---
Harrah's
$4,507,262 
$205,921 
$19,372,672 
$24,085,855 
---
Ocean Resort
$5,116,397 
$65,276 
$13,245,998 
$18,427,671 
---
Resorts
$2,257,149 
$ -
$9,859,813 
$12,116,962 
---
Tropicana
$4,377,139 
$152,876 
$17,501,139 
$22,031,154 
---
Total
$53,025,341 
$1,748,935 
$147,483,522 
$202,257,798 
---

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM