从 Python 中的 HTML 页面获取表的前几行

Question

I am making a GET request to this website through python.我正在通过 python 向该网站发出 GET 请求。

https://www.nhc.noaa.gov/gis/forecast/archive/?C=M;O=D

However, it downloads a HTML page with a huge table through the following python code但是，它通过以下 python 代码下载了一个带有巨大表的 HTML 页面

import requests
url = 'https://www.nhc.noaa.gov/gis/forecast/archive/?C=M;O=D'
r = requests.get(input_url_path)
url_list = r.text

This takes a lot of time and space.这需要大量的时间和空间。

Is there a way to download the first N rows of the table on this page?有没有办法下载此页面上表格的前 N 行？

Answer 1

Use streaming, and set your chunksize to get however much data you want back.使用流式传输，并设置您的块大小以获取您想要返回的任何数据。 You can iterate over the chunks until you get as many links as you want.您可以遍历这些块，直到获得所需数量的链接。 It will probably go over by a few depending on the chunk size but it'll get you pretty close.根据块的大小，它可能会 go 超过一些，但它会让你非常接近。

import requests
import re

n_rows = 100
url = 'https://www.nhc.noaa.gov/gis/forecast/archive/?C=M;O=D'
r = requests.get(url, stream=True)

links = []
with requests.get(url, stream=True) as r:
    for chunk in r.iter_content(chunk_size=500000):
        links.extend(re.findall(r'href="([^?\/].*?)"',str(x)))
        if len(links) >= n_rows:
                     break

从 Python 中的 HTML 页面获取表的前几行

问题描述

1 个解决方案

解决方案1
0 2020-12-11 18:07:03

从 Python 中的 HTML 页面获取表的前几行

问题描述

1 个解决方案

解决方案1 0 2020-12-11 18:07:03

解决方案1
0 2020-12-11 18:07:03