簡體   English   中英

抓取行過多的表

[英]Scraping a table with too many rows

我想使用 Python 獲取網站 'https://www.tgju.org/archive/price_dollar_rl' 上的所有表格

我寫道:

import requests
import pandas as pd

url = 'https://www.tgju.org/archive/price_dollar_rl'
html = requests.get(url).content

df_list = pd.read_html(html)
df = df_list[-1]

print(df)
df.to_csv('myy data.csv')

但是 95 個表中只有一個被保存。 我應該怎么做才能保存所有表? //

嗯,首先,文字中的url和代碼中的url是有區別的。

其次,該站點使用分頁,因此您需要使用 selenium 之類的東西通過腳本自動按下站點上的下一個按鈕,然后獲取 html 然后將其轉換為 csv

要獲取所有頁面,您可以模擬 Ajax 請求並直接從 API 加載數據:

import requests
import pandas as pd
from bs4 import BeautifulSoup

query = {
    "lang": "fa",
    "order_dir": ["asc", ""],
    "draw": "9",
    "columns[0][data]": "0",
    "columns[0][name]": "",
    "columns[0][searchable]": "true",
    "columns[0][orderable]": "true",
    "columns[0][search][value]": "",
    "columns[0][search][regex]": "false",
    "columns[1][data]": "1",
    "columns[1][name]": "",
    "columns[1][searchable]": "true",
    "columns[1][orderable]": "true",
    "columns[1][search][value]": "",
    "columns[1][search][regex]": "false",
    "columns[2][data]": "2",
    "columns[2][name]": "",
    "columns[2][searchable]": "true",
    "columns[2][orderable]": "true",
    "columns[2][search][value]": "",
    "columns[2][search][regex]": "false",
    "columns[3][data]": "3",
    "columns[3][name]": "",
    "columns[3][searchable]": "true",
    "columns[3][orderable]": "true",
    "columns[3][search][value]": "",
    "columns[3][search][regex]": "false",
    "columns[4][data]": "4",
    "columns[4][name]": "",
    "columns[4][searchable]": "true",
    "columns[4][orderable]": "true",
    "columns[4][search][value]": "",
    "columns[4][search][regex]": "false",
    "columns[5][data]": "5",
    "columns[5][name]": "",
    "columns[5][searchable]": "true",
    "columns[5][orderable]": "true",
    "columns[5][search][value]": "",
    "columns[5][search][regex]": "false",
    "columns[6][data]": "6",
    "columns[6][name]": "",
    "columns[6][searchable]": "true",
    "columns[6][orderable]": "true",
    "columns[6][search][value]": "",
    "columns[6][search][regex]": "false",
    "columns[7][data]": "7",
    "columns[7][name]": "",
    "columns[7][searchable]": "true",
    "columns[7][orderable]": "true",
    "columns[7][search][value]": "",
    "columns[7][search][regex]": "false",
    "start": "0",
    "length": "30",
    "search": "",
    "order_col": "",
    "from": "",
    "to": "",
    "convert_to_ad": "1",
    # "_": "1624699477042"
}

url = "https://api.accessban.com/v1/market/indicator/summary-table-data/price_dollar_rl"

out = []
for start in range(0, 10):  # <-- increase number of pages here
    print("Getting page {}...".format(start))
    query["start"] = start * 30

    data = requests.get(url, params=query).json()
    out.extend(data["data"])

df = pd.DataFrame(out)
df[4] = df[4].apply(lambda x: BeautifulSoup(x, "html.parser").text)
df[5] = df[5].apply(lambda x: BeautifulSoup(x, "html.parser").text)

print(df)
df.to_csv("data.csv", index=False)

印刷:

           0        1        2        3      4       5           6           7
0    241,690  241,190  242,440  241,890    100   0.04%  2021/06/24   1400/04/3
1    243,310  240,790  243,340  241,790    880   0.36%  2021/06/23   1400/04/2
2    241,190  241,190  243,140  242,670   1680    0.7%  2021/06/22   1400/04/1
3    239,940  239,190  241,440  240,990   1390   0.58%  2021/06/21  1400/03/31
4    234,810  234,690  240,440  239,600   4710   2.01%  2021/06/20  1400/03/30
5    244,490  234,690  244,640  234,890   9400      4%  2021/06/19  1400/03/29
6    242,010  241,950  244,640  244,290   2540   1.05%  2021/06/17  1400/03/27
7    240,470  239,450  242,250  241,750   1260   0.52%  2021/06/16  1400/03/26
8    239,970  239,950  240,050  240,490    770   0.32%  2021/06/15  1400/03/25
9    240,970  238,550  241,050  239,720   1310   0.55%  2021/06/14  1400/03/24
10   238,970  238,940  241,250  241,030   3280   1.38%  2021/06/13  1400/03/23
11   236,830  236,140  238,350  237,750   1480   0.62%  2021/06/12  1400/03/22
12   240,010  239,140  240,450  239,230   2210   0.92%  2021/06/10  1400/03/20

...

並保存data.csv (來自 LibreOffice 的截圖):

在此處輸入圖片說明

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM