简体   繁体   English

使用 Python 的不完整 HAR 列表:Browsermobproxy、selenium、phantomJS

[英]Incomplete HAR list using Python: Browsermobproxy, selenium, phantomJS

Fairly new to python, I learn by doing, so I thought I'd give this project a shot.对 python 相当陌生,我边做边学,所以我想我会试一试这个项目。 Trying to create a script which finds the google analytics request for a certain website parses the request payload and does something with it.尝试创建一个脚本来查找某个网站的 google 分析请求会解析请求负载并对其进行处理。

Here are the requirements:以下是要求:

  1. Ask user for 2 urls ( for comparing the payloads from 2 diff. HAR payloads)向用户询问 2 个 url(用于比较来自 2 个 diff. HAR 有效载荷的有效载荷)
  2. Use selenium to open the two urls, use browsermobproxy/phantomJS to get all HAR使用selenium打开两个url,使用browsermobproxy/phantomJS获取所有HAR
  3. Store the HAR as a list将 HAR 存储为列表
  4. From the list of all HAR files, find the google analytics request, including the payload从所有 HAR 文件的列表中,找到 google 分析请求,包括有效负载
  5. If Google Analytics tag found, then do things....like parse the payload, etc. compare the payload, etc.如果找到了 Google Analytics 标签,那么做一些事情......比如解析有效载荷等。比较有效载荷等。

Issue: Sometimes for a website that I know has google analytics, ie nytimes.com - the HAR that I get is incomplete, ie my prog.问题:有时对于我知道有谷歌分析的网站,即 nytimes.com - 我得到的 HAR 不完整,即我的编。 will say "GA Not found" but that's only because the complete HAR was not captured so when the regex ran to find the matching HAR it wasn't there.会说“未找到 GA”,但这只是因为没有捕获完整的 HAR,所以当正则表达式运行以找到匹配的 HAR 时,它并不存在。 This issue in intermittent and does not happen all the time.这个问题是间歇性的,不会一直发生。 Any ideas?有任何想法吗?

I'm thinking that due to some dependency or latency, the script moved on and that the complete HAR didn't get captured.我在想,由于某些依赖性或延迟,脚本继续前进并且没有捕获完整的 HAR。 I tried the "wait for traffic to stop" but maybe I didn't do something right.我尝试了“等待交通停止”,但也许我没有做对。

Also, as a bonus, I would appreciate any help you can provide on how to make this script run fast, its fairly slow.另外,作为奖励,我希望您能提供有关如何使该脚本快速运行(相当慢)的任何帮助。 As I mentioned, I'm new to python so go easy :)正如我所提到的,我是 python 的新手,所以很简单:)

This is what I've got thus far.这是我到目前为止所得到的。

import browsermobproxy as mob
from selenium import webdriver
import re
import sys
import urlparse
import time
from datetime import datetime


def cleanup():
    s.stop()
    driver.quit()

proxy_path = '/Users/bob/Downloads/browsermob-proxy-2.1.4-bin/browsermob-proxy-2.1.4/bin/browsermob-proxy'
s = mob.Server(proxy_path)
s.start()
proxy = s.create_proxy()
proxy_address = "--proxy=127.0.0.1:%s" % proxy.port
service_args = [proxy_address, '--ignore-ssl-errors=yes', '--ssl-protocol=any']  # so that i can do https connections
driver = webdriver.PhantomJS(executable_path='/Users/bob/Downloads/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin/phantomjs', service_args=service_args)
driver.set_window_size(1400, 1050)

urlLists = []
collectTags = []
gaCollect = 0
varList = []

for x in range(0,2): # I want to ask the user for 2 inputs
    url = raw_input("Enter a website to find GA on: ")
    time.sleep(2.0)
    urlLists.append(url)

    if not url:
        print "You need to type something in...here"
        sys.exit()
    #gets the two user url and stores in list

for urlList in urlLists:

    print urlList, 'start 2nd loop' #printing for debug purpose, no need for this

    if not urlList:
        print 'Your Url list is empty'
        sys.exit()

    proxy.new_har()
    driver.get(urlList)
    #proxy.wait_for_traffic_to_stop(15, 30) #<-- tried this but did not do anything

    for ent in proxy.har['log']['entries']:
        gaCollect = (ent['request']['url'])

        print gaCollect

        if re.search(r'google-analytics.com/r\b', gaCollect):

            print 'Found GA'
            collectTags.append(gaCollect)
            time.sleep(2.0)
            break
    else:

        print 'No GA Found - Ending Prog.'
        cleanup()
        sys.exit()

cleanup()

This might be a stale question, but I found an answer that worked for me.这可能是一个陈旧的问题,但我找到了一个对我有用的答案。

You need to change two things: 1 - Remove sys.exit() -- this causes your programme to stop after the first iteration through the ent list, so if what you want is not the first thing, it won't be found您需要更改两件事:1 - 删除 sys.exit() - 这会导致您的程序在通过 ent 列表进行第一次迭代后停止,因此如果您想要的不是第一件事,则不会找到它

2 - call new_har with the captureContent option enabled to get the payload of requests: proxy.new_har(options={'captureHeaders':True, 'captureContent': True}) 2 - 在启用 captureContent 选项的情况下调用 new_har 以获取请求的有效负载: proxy.new_har(options={'captureHeaders':True, 'captureContent': True})

See if that helps.看看是否有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM