简体   繁体   中英

how to correctly fetch dataframe from url in python?

I want to automate data fetching from USDA website, where I am specifically interested in few categories for data selection. To do so, I tried the following:

import io
import requests
import pandas as pd

url = 'https://www.marketnews.usda.gov/mnp/ls-report-retail?&repType=summary&portal=ls&category=Retail&species=BEEF&startIndex=1'

query_list = {"Report Type":"item","species":"BEEF","portal":"ls","category":"Retail", "Regions":"National", "Grades":"ALL", "Cut": "All", "Dates_from":"2019-03-01", "Dates_to":"2021-02-01"}
req = requests.get(url, params=query_list)
df = pd.read_csv(io.StringIO(req.text), sep="\s\s+", engine="python")
df.to_csv("usda_report.csv")

but I couldn't get the expected dataframe that I want, here is the output that after I tried to run above attempt:

ParserError: Expected 1 fields in line 117, saw 2. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

desired output

I need to pass these queries to do correct data selection: Category = "Retail"; Report Type = "Item"; Species = "Beef"; Region(s) = "National"; Dates_from = "2019-03-01"; Dates_to = "2021-02-15" Category = "Retail"; Report Type = "Item"; Species = "Beef"; Region(s) = "National"; Dates_from = "2019-03-01"; Dates_to = "2021-02-15" Category = "Retail"; Report Type = "Item"; Species = "Beef"; Region(s) = "National"; Dates_from = "2019-03-01"; Dates_to = "2021-02-15" .

ideally, I want to pass those queries and want to get the following dataframe (head of dataframe):

在此处输入图像描述

update

in my desired outputs, I need those columns: Date , Region , Grade , Cut , Retail Items , Outlets or number of stores , Weighted Avg

from the above attempt, I couldn't get the output dataframe like this. How should I fetch data correctly? Can anyone suggest possible of doing this right in pandas? any idea?

You must add the query parameter format=text to get the data in csv format from this web site.

url = 'https://www.marketnews.usda.gov/mnp/ls-report-retail'
query_list = {
    "format":"text", 
    "repType":"item",
    "species":"BEEF",
    "portal":"ls",
    "region":"NATIONAL",
    "cut":"0",
    "repDate":"03/01/2019", 
    "endDate":"02/01/2021",
}
req = requests.get(url, params=query_list)
df = pd.read_csv(io.StringIO(req.text), sep="\s\s+", engine="python")

You might have to modify the query parameters further. You can use the web site with your browser and change the filters you want. Then you can convert the current query parameters in the url to json with this command in the javascript terminal.

JSON.stringify(Object.fromEntries(new URLSearchParams(location.search)), null, 2)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM