[英]Incomplete HAR list using Python: Browsermobproxy, selenium, phantomJS
Fairly new to python, I learn by doing, so I thought I'd give this project a shot.对 python 相当陌生,我边做边学,所以我想我会试一试这个项目。 Trying to create a script which finds the google analytics request for a certain website parses the request payload and does something with it.尝试创建一个脚本来查找某个网站的 google 分析请求会解析请求负载并对其进行处理。
Here are the requirements:以下是要求:
Issue: Sometimes for a website that I know has google analytics, ie nytimes.com - the HAR that I get is incomplete, ie my prog.问题:有时对于我知道有谷歌分析的网站,即 nytimes.com - 我得到的 HAR 不完整,即我的编。 will say "GA Not found" but that's only because the complete HAR was not captured so when the regex ran to find the matching HAR it wasn't there.会说“未找到 GA”,但这只是因为没有捕获完整的 HAR,所以当正则表达式运行以找到匹配的 HAR 时,它并不存在。 This issue in intermittent and does not happen all the time.这个问题是间歇性的,不会一直发生。 Any ideas?有任何想法吗?
I'm thinking that due to some dependency or latency, the script moved on and that the complete HAR didn't get captured.我在想,由于某些依赖性或延迟,脚本继续前进并且没有捕获完整的 HAR。 I tried the "wait for traffic to stop" but maybe I didn't do something right.我尝试了“等待交通停止”,但也许我没有做对。
Also, as a bonus, I would appreciate any help you can provide on how to make this script run fast, its fairly slow.另外,作为奖励,我希望您能提供有关如何使该脚本快速运行(相当慢)的任何帮助。 As I mentioned, I'm new to python so go easy :)正如我所提到的,我是 python 的新手,所以很简单:)
This is what I've got thus far.这是我到目前为止所得到的。
import browsermobproxy as mob
from selenium import webdriver
import re
import sys
import urlparse
import time
from datetime import datetime
def cleanup():
s.stop()
driver.quit()
proxy_path = '/Users/bob/Downloads/browsermob-proxy-2.1.4-bin/browsermob-proxy-2.1.4/bin/browsermob-proxy'
s = mob.Server(proxy_path)
s.start()
proxy = s.create_proxy()
proxy_address = "--proxy=127.0.0.1:%s" % proxy.port
service_args = [proxy_address, '--ignore-ssl-errors=yes', '--ssl-protocol=any'] # so that i can do https connections
driver = webdriver.PhantomJS(executable_path='/Users/bob/Downloads/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin/phantomjs', service_args=service_args)
driver.set_window_size(1400, 1050)
urlLists = []
collectTags = []
gaCollect = 0
varList = []
for x in range(0,2): # I want to ask the user for 2 inputs
url = raw_input("Enter a website to find GA on: ")
time.sleep(2.0)
urlLists.append(url)
if not url:
print "You need to type something in...here"
sys.exit()
#gets the two user url and stores in list
for urlList in urlLists:
print urlList, 'start 2nd loop' #printing for debug purpose, no need for this
if not urlList:
print 'Your Url list is empty'
sys.exit()
proxy.new_har()
driver.get(urlList)
#proxy.wait_for_traffic_to_stop(15, 30) #<-- tried this but did not do anything
for ent in proxy.har['log']['entries']:
gaCollect = (ent['request']['url'])
print gaCollect
if re.search(r'google-analytics.com/r\b', gaCollect):
print 'Found GA'
collectTags.append(gaCollect)
time.sleep(2.0)
break
else:
print 'No GA Found - Ending Prog.'
cleanup()
sys.exit()
cleanup()
This might be a stale question, but I found an answer that worked for me.这可能是一个陈旧的问题,但我找到了一个对我有用的答案。
You need to change two things: 1 - Remove sys.exit() -- this causes your programme to stop after the first iteration through the ent list, so if what you want is not the first thing, it won't be found您需要更改两件事:1 - 删除 sys.exit() - 这会导致您的程序在通过 ent 列表进行第一次迭代后停止,因此如果您想要的不是第一件事,则不会找到它
2 - call new_har with the captureContent option enabled to get the payload of requests: proxy.new_har(options={'captureHeaders':True, 'captureContent': True}) 2 - 在启用 captureContent 选项的情况下调用 new_har 以获取请求的有效负载: proxy.new_har(options={'captureHeaders':True, 'captureContent': True})
See if that helps.看看是否有帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.