简体   繁体   English

Python:requests.get 获取错误的 html 文件

[英]Python: requests.get gets the wrong html file

I'm trying to scrape data from https://essentials.swissdox.ch , which only works with VPN.我正在尝试从仅适用于 VPN 的https://essentials.swissdox.ch中抓取数据。 So what I did is, I generated a URL with my query parameters and tried to get the correspondent html file.所以我所做的是,我用我的查询参数生成了一个 URL 并试图获取对应的 html 文件。 The problem is, that although the link works, Python gives me the html file of the starting page of https://essentials.swissdox.ch .问题是,虽然链接有效,但 Python 给了我https://essentials.swissdox.ch 起始页的 html 文件。 I really appreciate any help!我真的很感激任何帮助!

Example: I want the html file of the following url: https://essentials.swissdox.ch/View/log/index.jsp#&search=true&filter_de=la&sortorder=pubDateTime%20desc&formdata=%5B%7B%22name%22%3A%22SEARCH_mltid%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_sc%22%2C%22value%22%3A%22swissdox%22%7D%2C%7B%22name%22%3A%22SEARCH_query%22%2C%22value%22%3A%22lissabon%22%7D%2C%7B%22name%22%3A%22filter_de%22%2C%22value%22%3A%22de%22%7D%2C%7B%22name%22%3A%22SEARCH_exact%22%2C%22value%22%3A%22true%22%7D%2C%7B%22name%22%3A%22dateDropdown%22%2C%22value%22%3A%22-1%22%7D%2C%7B%22name%22%3A%22SEARCH_pubDate_lower%22%2C%22value%22%3A%222020-02-04%22%7D%2C%7B%22name%22%3A%22SEARCH_pubDate_upper%22%2C%22value%22%3A%222020-02-04%22%7D%2C%7B%22name%22%3A%22SEARCH_tiall%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_source%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_author%22%2C%22value%22%3A%22% Example: I want the html file of the following url: https://essentials.swissdox.ch/View/log/index.jsp#&search=true&filter_de=la&sortorder=pubDateTime%20desc&formdata=%5B%7B%22name%22%3A %22SEARCH_mltid%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_sc%22%2C%22value%22%3A%22swissdox%22%7D%2C%7B %22name%22%3A%22SEARCH_query%22%2C%22value%22%3A%22lissbon%22%7D%2C%7B%22name%22%3A%22filter_de%22%2C%22value%22%3A%22de%22 %7D%2C%7B%22name%22%3A%22SEARCH_exact%22%2C%22value%22%3A%22true%22%7D%2C%7B%22name%22%3A%22dateDropdown%22%2C%22value%22 %3A%22-1%22%7D%2C%7B%22name%22%3A%22SEARCH_pubDate_lower%22%2C%22value%22%3A%222020-02-04%22%7D%2C%7B%22name%22 %3A%22SEARCH_pubDate_upper%22%2C%22value%22%3A%222020-02-04%22%7D%2C%7B%22name%22%3A%22SEARCH_tiall%22%2C%22value%22%3A%22%22 %7D%2C%7B%22name%22%3A%22SEARCH_source%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_author%22%2C%22value%22 %3A%22% 22%7D%5D 22%7D%5D

Instead I get the html file of this page: https://essentials.swissdox.ch/View/log/index.jsp?reset=true相反,我得到此页面的 html 文件: https://essentials.swissdox.ch/View/log/index.jsp?reset=true

Here is what I have so far:这是我到目前为止所拥有的:

#Set keywords for URL
keyword_queries = ['lissabon']
startdate = "2007-01-01"
enddate = "2007-01-01"

#Encode  and hit URL
for keyword in keyword_queries:
    html_keyword= urllib.parse.quote_plus(keyword)
    URL = "https://essentials.swissdox.ch/View/log/index.jsp#&search=true&sortorder=pubDateTime%20desc&formdata=%5B%7B%22name%22%3A%22SEARCH_mltid%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_sc%22%2C%22value%22%3A%22swissdox%22%7D%2C%7B%22name%22%3A%22SEARCH_query%22%2C%22value%22%3A%22" + html_keyword + "%22%7D%2C%7B%22name%22%3A%22SEARCH_exact%22%2C%22value%22%3A%22true%22%7D%2C%7B%22name%22%3A%22dateDropdown%22%2C%22value%22%3A%22-1%22%7D%2C%7B%22name%22%3A%22SEARCH_pubDate_lower%22%2C%22value%22%3A%22" + startdate + "%22%7D%2C%7B%22name%22%3A%22SEARCH_pubDate_upper%22%2C%22value%22%3A%22" + enddate + "%22%7D%2C%7B%22name%22%3A%22SEARCH_tiall%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_source%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_author%22%2C%22value%22%3A%22%22%7D%5D"
    weburl  = urllib.request.urlopen(URL)

    
    #Hit the url
    ua = UserAgent()
    page = requests.get(URL, {"User-Agent": ua.random})
    soup = BeautifulSoup(page.content, "html.parser")
    results = soup.find('div', class_='documentlist')
    print(page.content)

It looks like you used '#' instead of '?'看起来您使用了“#”而不是“?” in your url.在您的 url 中。 Usually '?'通常 '?' will be used to start the query parameters, which are specified with '=' between key-value pairs.将用于启动查询参数,在键值对之间用'='指定。

Using '#' means to jump to a specific section in the page, in this case https://essentials.swissdox.ch/View/log/index.jsp which is what you are getting as response.使用“#”表示跳转到页面中的特定部分,在这种情况下https://essentials.swissdox.ch/View/log/index.jsp这是您得到的响应。 Changing '#' to '?'将“#”更改为“?” seems to throw an error about invalid characters on the original URL.似乎在原始 URL 上抛出有关无效字符的错误。 Make sure you use valid characters while percent encoding the query parameters.确保在对查询参数进行百分比编码时使用有效字符。

Wiki - URL Syntax Wiki - URL 语法

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM