[英]Scraping a table with too many rows
我想使用 Python 獲取網站 'https://www.tgju.org/archive/price_dollar_rl' 上的所有表格
我寫道:
import requests
import pandas as pd
url = 'https://www.tgju.org/archive/price_dollar_rl'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
df.to_csv('myy data.csv')
但是 95 個表中只有一個被保存。 我應該怎么做才能保存所有表? //
嗯,首先,文字中的url和代碼中的url是有區別的。
其次,該站點使用分頁,因此您需要使用 selenium 之類的東西通過腳本自動按下站點上的下一個按鈕,然后獲取 html 然后將其轉換為 csv
要獲取所有頁面,您可以模擬 Ajax 請求並直接從 API 加載數據:
import requests
import pandas as pd
from bs4 import BeautifulSoup
query = {
"lang": "fa",
"order_dir": ["asc", ""],
"draw": "9",
"columns[0][data]": "0",
"columns[0][name]": "",
"columns[0][searchable]": "true",
"columns[0][orderable]": "true",
"columns[0][search][value]": "",
"columns[0][search][regex]": "false",
"columns[1][data]": "1",
"columns[1][name]": "",
"columns[1][searchable]": "true",
"columns[1][orderable]": "true",
"columns[1][search][value]": "",
"columns[1][search][regex]": "false",
"columns[2][data]": "2",
"columns[2][name]": "",
"columns[2][searchable]": "true",
"columns[2][orderable]": "true",
"columns[2][search][value]": "",
"columns[2][search][regex]": "false",
"columns[3][data]": "3",
"columns[3][name]": "",
"columns[3][searchable]": "true",
"columns[3][orderable]": "true",
"columns[3][search][value]": "",
"columns[3][search][regex]": "false",
"columns[4][data]": "4",
"columns[4][name]": "",
"columns[4][searchable]": "true",
"columns[4][orderable]": "true",
"columns[4][search][value]": "",
"columns[4][search][regex]": "false",
"columns[5][data]": "5",
"columns[5][name]": "",
"columns[5][searchable]": "true",
"columns[5][orderable]": "true",
"columns[5][search][value]": "",
"columns[5][search][regex]": "false",
"columns[6][data]": "6",
"columns[6][name]": "",
"columns[6][searchable]": "true",
"columns[6][orderable]": "true",
"columns[6][search][value]": "",
"columns[6][search][regex]": "false",
"columns[7][data]": "7",
"columns[7][name]": "",
"columns[7][searchable]": "true",
"columns[7][orderable]": "true",
"columns[7][search][value]": "",
"columns[7][search][regex]": "false",
"start": "0",
"length": "30",
"search": "",
"order_col": "",
"from": "",
"to": "",
"convert_to_ad": "1",
# "_": "1624699477042"
}
url = "https://api.accessban.com/v1/market/indicator/summary-table-data/price_dollar_rl"
out = []
for start in range(0, 10): # <-- increase number of pages here
print("Getting page {}...".format(start))
query["start"] = start * 30
data = requests.get(url, params=query).json()
out.extend(data["data"])
df = pd.DataFrame(out)
df[4] = df[4].apply(lambda x: BeautifulSoup(x, "html.parser").text)
df[5] = df[5].apply(lambda x: BeautifulSoup(x, "html.parser").text)
print(df)
df.to_csv("data.csv", index=False)
印刷:
0 1 2 3 4 5 6 7
0 241,690 241,190 242,440 241,890 100 0.04% 2021/06/24 1400/04/3
1 243,310 240,790 243,340 241,790 880 0.36% 2021/06/23 1400/04/2
2 241,190 241,190 243,140 242,670 1680 0.7% 2021/06/22 1400/04/1
3 239,940 239,190 241,440 240,990 1390 0.58% 2021/06/21 1400/03/31
4 234,810 234,690 240,440 239,600 4710 2.01% 2021/06/20 1400/03/30
5 244,490 234,690 244,640 234,890 9400 4% 2021/06/19 1400/03/29
6 242,010 241,950 244,640 244,290 2540 1.05% 2021/06/17 1400/03/27
7 240,470 239,450 242,250 241,750 1260 0.52% 2021/06/16 1400/03/26
8 239,970 239,950 240,050 240,490 770 0.32% 2021/06/15 1400/03/25
9 240,970 238,550 241,050 239,720 1310 0.55% 2021/06/14 1400/03/24
10 238,970 238,940 241,250 241,030 3280 1.38% 2021/06/13 1400/03/23
11 236,830 236,140 238,350 237,750 1480 0.62% 2021/06/12 1400/03/22
12 240,010 239,140 240,450 239,230 2210 0.92% 2021/06/10 1400/03/20
...
並保存data.csv
(來自 LibreOffice 的截圖):
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.