简体   繁体   English

python中的urllib + cookielib数据包处理

[英]urllib+cookielib packet manipulation in python

I'm working on a project that will access a specific site to do a search and then I will filter and return the value; 我正在做一个将访问特定站点进行搜索的项目,然后将过滤并返回该值; the program logs in and then runs the search saving the cookie with a cookie jar to authenticate the connection while it runs the search . 程序登录并运行搜索,然后将cookie保存在一个cookie罐中,以便在运行搜索时对连接进行验证。 However when I run the program it returns no results and the packet header looks completely different. 但是,当我运行程序时,它不返回任何结果,并且包头看起来完全不同。 What am I doing wrong that the search always returns no results. 我做错了,搜索始终不返回结果。

Here is my code: 这是我的代码:

import cookielib, urllib, urllib2

file= open('results.txt', 'wb')

cj=cookielib.CookieJar()

opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

opener.addheaders=[('Referer', 'http:// site that runs the search/psc/p01ps1/EMPLOYEE/CRM/c/BANNER_TAP.SRCH_ATDO_TAP.GBL')]

opener.addheaders=[('User-Agent', 'Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20100101 Firefox/22.0')]

posts={'timezoneOffset':'180', 'userid':'user', 'pwd':'password', 'Submit':'Signon'}

data = urllib.urlencode(posts)

opens=opener.open('loginpage.com', data)

print cj

file.write(opens.read())

cjs=str(cj)

posts2 = urllib.urlencode({'ICType':'Panel', 'ICElementNum':0, 'ICStateNum':1, 'ICAction':'SRCH_ATD_TAP_WK_SRCH_PB', 'ICXPos':0, 'ICYPos':0, 'ICFocus':'', 'ICChanged':1, 'ICResubmit':0, 'ICFind':'', 'SRCH_ATD_TAP_WK_MSISDN_TAP':'', 'SRCH_ATD_TAP_WK_CNPJ_TAP':'', 'SRCH_ATD_TAP_WK_STATUS_RA_TAP':'', 'SRCH_ATD_TAP_WK_INTERACTION_ID':'', 'SRCH_ATD_TAP_WK_CASE_ID':48373914, 'SRCH_ATD_TAP_WK_PROTOCOLO_TAP':'', 'SRCH_ATD_TAP_WK_DATA_INI_BAN_TAP':'', 'SRCH_ATD_TAP_WK_HORA_INI_RA_TAP':'', 'SRCH_ATD_TAP_WK_DATA_FIM_BAN_TAP':'', 'SRCH_ATD_TAP_WK_HORA_FIM_BAN_TAP':'', 'SRCH_ATD_TAP_WK_MOTIVO_ID1_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID2_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID3_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID4_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID5_TAP':0, 'SRCH_ATD_TAP_WK_COMPANY_TYPE_TAP':'','SRCH_ATD_TAP_WK_SUBTIPO_CLI_TAP':''})

url2='searchpage.com'

opens2 = opener.open(url2, posts2) 

str=opens2.read()

print cj

file.write(str + cjs)

file.close()

It connects the first time to the login page to save the cookie and then it connects to the the search page. 它第一次连接到登录页面以保存cookie,然后连接到搜索页面。 Again this is just to be used on one site so the connections and post data are very specific. 同样,这只是在一个站点上使用,因此连接和发布数据非常具体。

Again, this code doesn't return any results (after searching the str var which is the entire unfiltered site. 同样,此代码不返回任何结果(在搜索作为整个未过滤站点的str var之后。

Here are the results I get when scanning the the requests with wireshark, the first one is the site ran in firefox doing the search in a normal browser (including the post data sent) and the second one is my program running and automating the search for me. 这是我用wireshark扫描请求时得到的结果,第一个是在firefox中运行的网站在普通浏览器中进行搜索(包括发送的帖子数据),第二个是我的程序正在运行并自动搜索我。

POST /psc/p01ps1/EMPLOYEE/CRM/c/BANNER_TAP.SRCH_ATDO_TAP.GBL HTTP/1.1
Host: siteroot
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20100101 Firefox/22.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: pt-BR,pt;q=0.8,en-US;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Referer: site that runs the search/BANNER_TAP.SRCH_ATDO_TAP.GBL #note I wasn't able to create this header.
Cookie: SignOnDefault=my login id; PS_LOGINLIST=http:// siteroot; brux0128-claro-com-br-7090-PORTAL-PSJSESSIONID=dpLmTCpY8vTmj4nMHbpyptPMdvphpRLR!841308261; ExpirePage=http:// siteroot/psp/p01ps1/; PS_TOKEN=AAAAogECAwQAAQAAAAACvAAAAAAAAAAsAARTaGRyAgBOcQgAOAAuADEAMBSfJDUA/BR2T3ekF0/cVhdJ7uJlpgAAAGIABVNkYXRhVnicHYpBCoAgFESfFi2jixRqYrgO2hbWvjN0vw7X5B94bxg+8BjbtBh09v05kJlxpGq1joOd0ksnGxc3KyUS9OSJjHIQPUtlYNLqK52Ya5Li+ABuIwtr; http%3a%2f%2fsiteroot%2fpsp%2fp01ps1%2femployee%2fcrm%2frefresh=list:||||||; PS_360=PS_360_BO_ID_CUST!0!PS_360_CUST_SETID!!PS_360_BO_ID_CONT!0!PS_360_BO_ID_SITE!0!PS_360_CUST_ROLE!0!PS_360_CONT_ROLE!0!PS_360_BO_ID!0!PS_360_VIEW_OPTION!False; PS_TOKENEXPIRE=18_Feb_2014_00:04:41_GMT; HPTabName=DEFAULT
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 683

POST DATA: ICType=Panel&ICElementNum=0&ICStateNum=17&ICAction=SRCH_ATD_TAP_WK_SRCH_PB&ICXPos=0&ICYPos=84&ICFocus=&ICChanged=1&ICResubmit=0&ICFind=&SRCH_ATD_TAP_WK_MSISDN_TAP=&SRCH_ATD_TAP_WK_CNPJ_TAP=&SRCH_ATD_TAP_WK_STATUS_RA_TAP=&SRCH_ATD_TAP_WK_INTERACTION_ID=&SRCH_ATD_TAP_WK_CASE_ID=48373914&SRCH_ATD_TAP_WK_PROTOCOLO_TAP=&SRCH_ATD_TAP_WK_DATA_INI_BAN_TAP=&SRCH_ATD_TAP_WK_HORA_INI_RA_TAP=&SRCH_ATD_TAP_WK_DATA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_HORA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID1_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID2_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID3_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID4_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID5_TAP=0&SRCH_ATD_TAP_WK_COMPANY_TYPE_TAP=&SRCH_ATD_TAP_WK_SUBTIPO_CLI_TAP=



POST /psc/p01ps1/EMPLOYEE/CRM/c/BANNER_TAP.SRCH_ATDO_TAP.GBL HTTP/1.1
Accept-Encoding: identity
Content-Length: 681
Host: siteroot
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20100101 Firefox/22.0
Connection: close
Cookie: PS_TOKEN=AAAAogECAwQAAQAAAAACvAAAAAAAAAAsAARTaGRyAgBOcQgAOAAuADEAMBSX+ZILWKx7oU/VKvJbVT8LbueJtwAAAGIABVNkYXRhVnicJYpLCoAwDAWnVVyKF1Hsh2rXgluluvcM3s/DGWNCZh6PALexVY1Bxj4fOzKBkaSW1LCzUVrRwcrJxUKJeHlyRHqxFzomZWCQZlYm5b9Z7gVtawtT; ExpirePage=siteroot; PS_LOGINLIST=siteroot; PS_TOKENEXPIRE=18_Feb_2014_00:08:09_GMT; brux0128-claro-com-br-7090-PORTAL-PSJSESSIONID=QG14TCkJK7PpfRtNH0CSCw9S1m6jtRR9!841308261; SignOnDefault=my login id; http%3a%2f%2fsiteroot%2fpsp%2fp01ps1%2femployee%2fcrm%2frefresh=list:
Content-Type: application/x-www-form-urlencoded

POST DATA: SRCH_ATD_TAP_WK_DATA_INI_BAN_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID4_TAP=0&ICResubmit=0&ICXPos=0&SRCH_ATD_TAP_WK_DATA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_PROTOCOLO_TAP=&SRCH_ATD_TAP_WK_SUBTIPO_CLI_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID3_TAP=0&ICAction=SRCH_ATD_TAP_WK_SRCH_PB&SRCH_ATD_TAP_WK_MOTIVO_ID5_TAP=0&ICElementNum=0&SRCH_ATD_TAP_WK_INTERACTION_ID=&ICType=Panel&SRCH_ATD_TAP_WK_STATUS_RA_TAP=&SRCH_ATD_TAP_WK_COMPANY_TYPE_TAP=&SRCH_ATD_TAP_WK_HORA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID2_TAP=0&ICFind=&SRCH_ATD_TAP_WK_MOTIVO_ID1_TAP=0&SRCH_ATD_TAP_WK_HORA_INI_RA_TAP=&ICChanged=1&ICStateNum=1&ICYPos=0&ICFocus=&SRCH_ATD_TAP_WK_CASE_ID=48373914&SRCH_ATD_TAP_WK_MSISDN_TAP=&SRCH_ATD_TAP_WK_CNPJ_TAP=

(This is for personal use at the company I work at to make this task more simple which needs to be done around 500 times at this point manualy. it is a site that registers protocols and we need to search the protocols to check if (later will import a list from excel) the protocol is closed of not) (这是我工作的公司的个人使用,目的是简化此任务,这时需要手动完成约500次。这是一个注册协议的站点,我们需要搜索协议以检查(以后将从excel导入列表)协议未关闭)

note that I don't have the additional headers but if that could solve the problem I can. 请注意,我没有其他标题,但是如果可以解决问题,我可以。 And for some reason my post data gets all disorganized ( but from what I understand about post data that shouldn't make a difference) and the cookie information is also somewhhat backwards, but that also shouldn't matter I would assum because to retrieve the cookie info is handled much like a python dictionary. 出于某种原因,我的发布数据变得杂乱无章(但据我了解,发布数据应该没有什么不同),并且cookie信息也有些倒退,但这也没关系,我会假设因为检索了Cookie信息的处理方式非常类似于python字典。

so I've been breaking my head over this little code and rewritting it several times for the past two weeks and I still can't get it to return the search results. 因此,在过去的两周里,我一直不停地写这小小的代码,并对其进行了多次重写,但我仍然无法获得它来返回搜索结果。 it's also important to note that I won't be able to install the browser core to be able to execute the javascript, but I also don't think that it's necessary do to the fact that the results from the search done on firefox show in wireshark, so the site is downloaded with the result. 还需要注意的是,我将无法安装浏览器核心以执行javascript,但我也不认为有必要对在firefox上进行的搜索结果显示wirehark,因此将下载结果站点。 I was able to get mechanize running, but I havn't been able to try it yet. 我可以使机械化运行,但是还无法尝试。 If there is a way to automate firefox (I don't remember which version at this moment) with python, that is an option that I'm open to. 如果有一种方法可以使用python自动执行firefox(我现在不记得哪个版本),那是我可以选择的选项。 One ore thing, because I'm working on this project at work, I'm not able to use and python plugin that has to be installed. 一件事,因为我正在工作这个项目,所以我无法使用必须安装的python插件。 I got mechanize to work because I open and copied the file over, with out running the setup.py. 我机械化了,因为我打开并复制了文件,而没有运行setup.py。 So just to make things easier, I have no way to install libraries. 因此,为了使事情变得简单,我无法安装库。

You don't have PS_360 set in your cookie. 您的cookie中没有设置PS_360 Not sure how essential this is, but the best strategy going through these issues is to get step by step identical requests. 不确定这有多重要,但是解决这些问题的最佳策略是逐步获得相同的请求。 Probably the first request to get ỳour cookie set was already different, or your browser has cookie data from previous requests, that you need to create manually for your request. 获取您的cookie集的第一个请求可能已经不同,或者您的浏览器具有来自先前请求的cookie数据,您需要为该请求手动创建。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM