简体   繁体   English

Python:如何填写 web 表单并获取生成的页面源

[英]Python: how to fill out a web form and get the resulting page source

I am trying to write a python script that will scrape http://www.fakenewsai.com/ and tell me whether or not a news article is fake news.我正在尝试编写一个python脚本来抓取http://www.fakenewsai.com/并告诉我一篇新闻文章是否是假新闻。 I want the script to input a given news article into the website's url input field and hit the submit button.我希望脚本将给定的新闻文章输入网站的url输入字段并点击submit按钮。 Then, I want to scrape the website to determine whether the article is "fake" or "real" news, as displayed on the website.然后,我想抓取网站以确定文章是“假”还是“真实”新闻,如网站上显示的那样。

I was successful in accomplishing this using selenium and ChromeDriver , but the script was very slow (>2 minutes) and did not run on Heroku (using flask ).我使用seleniumChromeDriver成功完成了此操作,但是脚本非常慢(> 2 分钟)并且没有在Heroku上运行(使用flask )。 For reference, here is the code I used:作为参考,这是我使用的代码:

from selenium import webdriver
import time

def fakeNews(url):
  if url.__contains__("https://"):
    url = url[8:-1]
  if url.__contains__("http://"):
    url = url[7:-1]
  browser = webdriver.Chrome("static/chromedriver.exe")
  browser.get("http://www.fakenewsai.com")
  element = browser.find_element_by_id("url")
  element.send_keys(url)
  button = browser.find_element_by_id("submit")
  button.click()
  time.sleep(1)
  site = "" + browser.page_source
  result = ""
  if(site[site.index("opacity: 1")-10] == "e"):
    result = "Fake News"
  else:
    result = "Real News"
  browser.quit()
  return result

print(fakeNews('https://www.nytimes.com/2019/11/02/opinion/sunday/instagram-social-media.html'))

I have attempted to replicate this code using other python libraries, such as mechanicalsoup , pyppeteer , and scrapy .我尝试使用其他python库(例如mechanicalsouppyppeteerscrapy )复制此代码。 However, as a beginner at python , I have not found much success.但是,作为python的初学者,我并没有取得太大的成功。 I was hoping someone could point me in the right direction with a solution.我希望有人可以通过解决方案为我指明正确的方向。

For the stated purpose, in my opinion it would be much more simple to analyze the website, understand it's functionality and then automate the browser behavior instead of the user behavior.出于上述目的,在我看来,分析网站、了解其功能然后自动化浏览器行为而不是用户行为会更简单。

Try to hit F12 on your browser while on the website, open the Network tab, paste a URL on the input box and then hit submit, you will see that it sends a HTTP OPTIONS request and then a POST request to a URL. Try to hit F12 on your browser while on the website, open the Network tab, paste a URL on the input box and then hit submit, you will see that it sends a HTTP OPTIONS request and then a POST request to a URL. The server then returns a JSON response as a result.然后服务器返回 JSON 响应作为结果。

So, you can use Python's request module ( docs ) to automate the very POST request instead of having a very complex code that simulates clicks and scrapes the result.因此,您可以使用 Python 的请求模块 ( docs ) 来自动执行 POST 请求,而不是使用非常复杂的代码来模拟点击并抓取结果。

A very simple example you can build on is:您可以构建的一个非常简单的示例是:

import json
import requests


def fake_news():
    url = 'https://us-central1-fake-news-ai.cloudfunctions.net/detect/'
    payload = {'url': 'https://www.nytimes.com/'}
    headers = {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'en-US,en;q=0.5',
               'Connection': 'keep-alive', 'Content-Length': '103', 'Content-type': 'application/json; charset=utf-8',
               'DNT': '1', 'Host': 'us-central1-fake-news-ai.cloudfunctions.net', 'Origin': 'http://www.fakenewsai.com',
               'Referer': 'http://www.fakenewsai.com/', 'TE': 'Trailers',
               'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'}

    response_json = requests.post(url, data=json.dumps(payload), headers=headers).text
    response = json.loads(response_json)
    is_fake = int(response['fake'])

    if is_fake == 0:
        print("Not fake")
    elif is_fake == 1:
        print("Fake")
    else:
        print("Invalid response from server")


if __name__ == "__main__":
    fake_news()

PS: It would be fair to contact the owner of the website to discuss using his or her infrastructure for your project. PS:联系网站所有者讨论将他或她的基础设施用于您的项目是公平的。

The main slowdown occurs on starting a chrome browser and locating the first URL.主要的减速发生在启动 chrome 浏览器并找到第一个 URL 时。 Note that you are launching a browser for each request.请注意,您正在为每个请求启动一个浏览器。 You can launch a browser on the initialization step and only do the automation parts per request.您可以在初始化步骤中启动浏览器,并且只执行每个请求的自动化部分。 This will greatly increase the performance.这将大大提高性能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM