简体   繁体   English

有什么方法可以通过 python 中的 url 的自定义查询下载数据?

[英]any way to download the data with custom queries from url in python?

I want to download the data from USDA site with custom queries.我想通过自定义查询从USDA网站下载数据。 So instead of manually selecting queries in the website, I am thinking about how should I do this handier in python.因此,我不是在网站中手动选择查询,而是在考虑如何在 python 中更方便地进行此操作。 To do so, I used request , http to access the url and read the content, it is not intuitive for me how should I pass the queries then make a selection and download the data as csv .为此,我使用requesthttp访问 url 并阅读内容,这对我来说并不直观,我应该如何通过查询然后进行选择并将数据下载为csv Does anyone knows of doing this easily in python?有谁知道在 python 中轻松做到这一点? Is there any workaround we could download the data from url with specific queries?有什么解决方法可以通过特定查询从 url 下载数据? Any idea?任何想法?

this is my current attempt这是我目前的尝试

here is the url that I am going to select data with custom queries.这是url ,我将使用自定义查询访问 select 数据。

import io
import requests
import pandas as pd

url="https://www.marketnews.usda.gov/mnp/ls-report-retail?&repType=summary&portal=ls&category=Retail&species=BEEF&startIndex=1"
s=requests.get(url).content
df=pd.read_csv(io.StringIO(s.decode('utf-8')))

so before reading the requested json in pandas , I need to pass following queries for correct data selection:所以在阅读 pandas 中请求的pandas ,我需要通过以下查询以正确选择数据:

Category = "Retail"
Report Type = "Item"
Species = "Beef"
Region(s) = "National"
Start Dates = "2020-01-01"
End Date = "2021-02-08"

it is not intuitive for me how should I pass the queries with requested json then download the filtered data as csv .我应该如何通过请求的 json 传递查询,然后将过滤后的数据下载为csv ,这对我来说并不直观。 Is there any efficient way of doing this in python?在 python 中是否有任何有效的方法可以做到这一点? Any thoughts?有什么想法吗? Thanks谢谢

A few details一些细节

  • simplest format is text rather that HTML.最简单的格式是文本而不是 HTML。 Got URL from HTML page for text download从 HTML 页面获取 URL 用于文本下载
  • requests(params=) is a dict . requests(params=)是一个dict Built it up and passed, no need to deal with building complete URL string构建并通过,无需处理构建完整的 URL 字符串
  • clearly text is space delimited, found minimum of double space显然文本是空格分隔的,找到最小的双倍空格
import io
import requests
import pandas as pd

url="https://www.marketnews.usda.gov/mnp/ls-report-retail"
p = {"repType":"summary","species":"BEEF","portal":"ls","category":"Retail","format":"text"}
r = requests.get(url, params=p)
df = pd.read_csv(io.StringIO(r.text), sep="\s\s+", engine="python")

Date日期 Region地区 Feature Rate特征率 Outlets奥特莱斯 Special Rate特价 Activity Index活动指数
0 0 02/05/2021 2021 年 2 月 5 日 NATIONAL国民 69.40% 69.40% 29,200 29,200 20.10% 20.10% 81,650 81,650
1 1 02/05/2021 2021 年 2 月 5 日 NORTHEAST东北 75.00% 75.00% 5,500 5,500 3.80% 3.80% 17,520 17,520
2 2 02/05/2021 2021 年 2 月 5 日 SOUTHEAST东南 70.10% 70.10% 7,400 7,400 28.00% 28.00% 23,980 23,980
3 3 02/05/2021 2021 年 2 月 5 日 MIDWEST中西部 75.10% 75.10% 6,100 6,100 19.90% 19.90% 17,430 17,430
4 4 02/05/2021 2021 年 2 月 5 日 SOUTH CENTRAL中南部 57.90% 57.90% 4,900 4,900 26.40% 26.40% 9,720 9,720
5 5 02/05/2021 2021 年 2 月 5 日 NORTHWEST西北 77.50% 77.50% 1,300 1,300 2.50% 2.50% 3,150 3,150
6 6 02/05/2021 2021 年 2 月 5 日 SOUTHWEST西南 63.20% 63.20% 3,800 3,800 27.50% 27.50% 9,360 9,360
7 7 02/05/2021 2021 年 2 月 5 日 ALASKA阿拉斯加州 87.00% 87.00% 200 200 .00% .00% 290 290
8 8 02/05/2021 2021 年 2 月 5 日 HAWAII夏威夷 46.70% 46.70% 100 100 .00% .00% 230 230

Just format the query data in the url - it's actually a REST API:只需格式化 url 中的查询数据 - 它实际上是 REST API:

To add more query data, as @mullinscr said, you can change the values on the left and press submit, then see the query name in the URL (for example, start date is called repDate ).要添加更多查询数据,正如@mullinscr 所说,您可以更改左侧的值并按提交,然后在 URL 中查看查询名称(例如,开始日期称为repDate )。

If you hover on the Download as XML link, you will also discover you can specify the download format using format=<format_name> .如果您在下载为 XML 链接上使用 hover,您还会发现可以使用format=<format_name>指定下载格式。 Parsing the tabular data in XML using pandas might be easier, so I would append format=xml at the end as well.使用 pandas 解析 XML 中的表格数据可能更容易,所以我也会在最后使用 append format=xml

category = "Retail"
report_type = "Item"
species = "BEEF"
regions = "NATIONAL"
start_date = "01-01-2019"
end_date = "01-01-2021"

# the website changes "-" to "%2F"
start_date = start_date.replace("-", "%2F")
end_date = end_date.replace("-", "%2F")

url = f"https://www.marketnews.usda.gov/mnp/ls-report-retail?runReport=true&portal=ls&startIndex=1&category={category}&repType={report_type}&species={species}&region={regions}&repDate={start_date}&endDate={end_date}&compareLy=No&format=xml"

# parse with pandas, etc...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM