简体   繁体   English

从 Python 中的 HTML 页面获取表的前几行

[英]Fetch first few rows of a table from a HTML page in Python

I am making a GET request to this website through python.我正在通过 python 向该网站发出 GET 请求。

https://www.nhc.noaa.gov/gis/forecast/archive/?C=M;O=D

However, it downloads a HTML page with a huge table through the following python code但是,它通过以下 python 代码下载了一个带有巨大表的 HTML 页面

import requests
url = 'https://www.nhc.noaa.gov/gis/forecast/archive/?C=M;O=D'
r = requests.get(input_url_path)
url_list = r.text

This takes a lot of time and space.这需要大量的时间和空间。

Is there a way to download the first N rows of the table on this page?有没有办法下载此页面上表格的前 N 行?

Use streaming, and set your chunksize to get however much data you want back.使用流式传输,并设置您的块大小以获取您想要返回的任何数据。 You can iterate over the chunks until you get as many links as you want.您可以遍历这些块,直到获得所需数量的链接。 It will probably go over by a few depending on the chunk size but it'll get you pretty close.根据块的大小,它可能会 go 超过一些,但它会让你非常接近。

import requests
import re

n_rows = 100
url = 'https://www.nhc.noaa.gov/gis/forecast/archive/?C=M;O=D'
r = requests.get(url, stream=True)

links = []
with requests.get(url, stream=True) as r:
    for chunk in r.iter_content(chunk_size=500000):
        links.extend(re.findall(r'href="([^?\/].*?)"',str(x)))
        if len(links) >= n_rows:
                     break

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM