有什么方法可以通过 python 中的 url 的自定义查询下载数据？

Question

I want to download the data from USDA site with custom queries.我想通过自定义查询从USDA网站下载数据。 So instead of manually selecting queries in the website, I am thinking about how should I do this handier in python.因此，我不是在网站中手动选择查询，而是在考虑如何在 python 中更方便地进行此操作。 To do so, I used request , http to access the url and read the content, it is not intuitive for me how should I pass the queries then make a selection and download the data as csv .为此，我使用request ， http访问 url 并阅读内容，这对我来说并不直观，我应该如何通过查询然后进行选择并将数据下载为csv 。 Does anyone knows of doing this easily in python?有谁知道在 python 中轻松做到这一点？ Is there any workaround we could download the data from url with specific queries?有什么解决方法可以通过特定查询从 url 下载数据？ Any idea?任何想法？

this is my current attempt这是我目前的尝试

here is the url that I am going to select data with custom queries.这是url ，我将使用自定义查询访问 select 数据。

import io
import requests
import pandas as pd

url="https://www.marketnews.usda.gov/mnp/ls-report-retail?&repType=summary&portal=ls&category=Retail&species=BEEF&startIndex=1"
s=requests.get(url).content
df=pd.read_csv(io.StringIO(s.decode('utf-8')))

so before reading the requested json in pandas , I need to pass following queries for correct data selection:所以在阅读 pandas 中请求的pandas ，我需要通过以下查询以正确选择数据：

Category = "Retail"
Report Type = "Item"
Species = "Beef"
Region(s) = "National"
Start Dates = "2020-01-01"
End Date = "2021-02-08"

it is not intuitive for me how should I pass the queries with requested json then download the filtered data as csv .我应该如何通过请求的 json 传递查询，然后将过滤后的数据下载为csv ，这对我来说并不直观。 Is there any efficient way of doing this in python?在 python 中是否有任何有效的方法可以做到这一点？ Any thoughts?有什么想法吗？ Thanks谢谢

Answer 1

A few details一些细节

simplest format is text rather that HTML.最简单的格式是文本而不是 HTML。 Got URL from HTML page for text download从 HTML 页面获取 URL 用于文本下载
requests(params=) is a dict . requests(params=)是一个dict 。 Built it up and passed, no need to deal with building complete URL string构建并通过，无需处理构建完整的 URL 字符串
clearly text is space delimited, found minimum of double space显然文本是空格分隔的，找到最小的双倍空格

import io
import requests
import pandas as pd

url="https://www.marketnews.usda.gov/mnp/ls-report-retail"
p = {"repType":"summary","species":"BEEF","portal":"ls","category":"Retail","format":"text"}
r = requests.get(url, params=p)
df = pd.read_csv(io.StringIO(r.text), sep="\s\s+", engine="python")

	Date日期	Region地区	Feature Rate特征率	Outlets奥特莱斯	Special Rate特价	Activity Index活动指数
0 0	02/05/2021 2021 年 2 月 5 日	NATIONAL国民	69.40% 69.40%	29,200 29,200	20.10% 20.10%	81,650 81,650
1 1	02/05/2021 2021 年 2 月 5 日	NORTHEAST东北	75.00% 75.00%	5,500 5,500	3.80% 3.80%	17,520 17,520
2 2	02/05/2021 2021 年 2 月 5 日	SOUTHEAST东南	70.10% 70.10%	7,400 7,400	28.00% 28.00%	23,980 23,980
3 3	02/05/2021 2021 年 2 月 5 日	MIDWEST中西部	75.10% 75.10%	6,100 6,100	19.90% 19.90%	17,430 17,430
4 4	02/05/2021 2021 年 2 月 5 日	SOUTH CENTRAL中南部	57.90% 57.90%	4,900 4,900	26.40% 26.40%	9,720 9,720
5 5	02/05/2021 2021 年 2 月 5 日	NORTHWEST西北	77.50% 77.50%	1,300 1,300	2.50% 2.50%	3,150 3,150
6 6	02/05/2021 2021 年 2 月 5 日	SOUTHWEST西南	63.20% 63.20%	3,800 3,800	27.50% 27.50%	9,360 9,360
7 7	02/05/2021 2021 年 2 月 5 日	ALASKA阿拉斯加州	87.00% 87.00%	200 200	.00% .00%	290 290
8 8	02/05/2021 2021 年 2 月 5 日	HAWAII夏威夷	46.70% 46.70%	100 100	.00% .00%	230 230

Answer 2

Just format the query data in the url - it's actually a REST API:只需格式化 url 中的查询数据 - 它实际上是 REST API：

To add more query data, as @mullinscr said, you can change the values on the left and press submit, then see the query name in the URL (for example, start date is called repDate ).要添加更多查询数据，正如@mullinscr 所说，您可以更改左侧的值并按提交，然后在 URL 中查看查询名称（例如，开始日期称为repDate ）。

If you hover on the Download as XML link, you will also discover you can specify the download format using format=<format_name> .如果您在下载为 XML 链接上使用 hover，您还会发现可以使用format=<format_name>指定下载格式。 Parsing the tabular data in XML using pandas might be easier, so I would append format=xml at the end as well.使用 pandas 解析 XML 中的表格数据可能更容易，所以我也会在最后使用 append format=xml 。

category = "Retail"
report_type = "Item"
species = "BEEF"
regions = "NATIONAL"
start_date = "01-01-2019"
end_date = "01-01-2021"

# the website changes "-" to "%2F"
start_date = start_date.replace("-", "%2F")
end_date = end_date.replace("-", "%2F")

url = f"https://www.marketnews.usda.gov/mnp/ls-report-retail?runReport=true&portal=ls&startIndex=1&category={category}&repType={report_type}&species={species}&region={regions}&repDate={start_date}&endDate={end_date}&compareLy=No&format=xml"

# parse with pandas, etc...

有什么方法可以通过 python 中的 url 的自定义查询下载数据？

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-02-08 16:08:46

解决方案2
1 2021-02-08 16:11:14

有什么方法可以通过 python 中的 url 的自定义查询下载数据？

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-02-08 16:08:46

解决方案2 1 2021-02-08 16:11:14

解决方案1
2 已采纳 2021-02-08 16:08:46

解决方案2
1 2021-02-08 16:11:14