简体   繁体   English

如何找到广告的最终目的地(网址)(以编程方式)

[英]How to find the final destination (url) of an ad (programmatically)

This may be trivial, or not, but I'm working on a piece of software that will verify the "end of the line" domain for ads displayed through my web application. 这可能是微不足道的,或者不是,但我正在研究一种软件,它将验证通过我的Web应用程序显示的广告的“行尾”域。 Ideally, I have a list of domains I do not want to serve ads from (let's say Norton.com is one of them) but most ad networks serve ads via shortened, and cryptic, URLs (adsrv.com), that eventually redirect to Norton.com. 理想情况下,我有一个我不想提供广告的域名列表(假设Norton.com就是其中之一),但大多数广告网络通过缩短的,含义模糊的URL(adsrv.com)提供广告,最终重定向到Norton.com。 So the question is: has any one built, or have an idea of how to build, a scraper-like tool that will return the final destination url of an ad. 所以问题是:有任何一个构建,或者知道如何构建,类似刮刀的工具将返回广告的最终目标网址。

Initial discovery: Some ads are in Flash, JavaScript, or plain HTML. 初步发现:某些广告采用Flash,JavaScript或纯HTML格式。 Emulating a browser is perfectly viable, and would combat different formats of ads. 模拟浏览器是完全可行的,并且可以对抗不同格式的广告。 Not all Flash or JS ads have a noflash or noscript alternative. 并非所有Flash或JS广告都有noflash或noscript替代品。 (Browser may be necessary, but as stated this is perfectly fine... Using something like WatiN or WatiR or WatiJ or Selenium, etc...) (浏览器可能是必要的,但如上所述,这非常好......使用像WatiN或WatiR或WatiJ或Selenium等的东西......)

Prefer open source so that I could rebuild one myself. 喜欢开源,这样我就可以自己重建一个。 Really appreciate help! 真的很感激帮助!

EDIT* This script needs to Click on the ad, since it might be Flash, JS, or just HTML plain. 编辑*此脚本需要点击广告,因为它可能是Flash,JS或只是HTML plain。 So Curl is less likely an option, unless Curl can click? 因此Curl不太可能是一个选项,除非Curl可以点击?

Sample PHP Implementation: 示例PHP实现:

$k = curl_init('http://goo.gl');
curl_setopt($k, CURLOPT_FOLLOWLOCATION, true); // follow redirects
curl_setopt($k, CURLOPT_USERAGENT, 
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 ' .
'(KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7'); // imitate chrome
curl_setopt($k, CURLOPT_NOBODY, true); // HEAD request only (faster)
curl_setopt($k, CURLOPT_RETURNTRANSFER, true); // don't echo results
curl_exec($k);
$final_url = curl_getinfo($k, CURLINFO_EFFECTIVE_URL); // get last URL followed
curl_close($k);
echo $final_url;

Which should return something like https://www.google.com/accounts/ServiceLogin?service=urlshortener&continue=http://goo.gl/?authed%3D1&followup=http://goo.gl/?authed%3D1&passive=true&go=true 哪个应返回类似https://www.google.com/accounts/ServiceLogin?service=urlshortener&continue=http://goo.gl/?authed%3D1&followup=http://goo.gl/?authed%3D1&passive=true&go=true

Note: You might need to use curl_setopt() to turn off CURLOPT_SSL_VERIFYHOST and CURLOPT_SSL_VERIFYPEER if you want to reliably follow across HTTPS/SSL 注意:如果要可靠地遵循HTTPS / SSL,则可能需要使用curl_setopt()来关闭CURLOPT_SSL_VERIFYHOSTCURLOPT_SSL_VERIFYPEER

curl --head -L -s -o /dev/null -w %{url_effective} <some-short-url>
  • --head restricts it to HEAD requests only, so that you don't have to actually download the pages --head将其限制为HEAD请求,因此您无需实际下载页面

  • -L tells curl to keep following redirects -L告诉curl继续关注重定向

  • -s gets rid of any progress meters, etc -s摆脱任何进度表等

  • -o /dev/null tells curl to throw away the headers retrieved (we don't care about them) -o /dev/null告诉curl丢弃检索到的头文件(我们不关心它们)

  • -w %{url_effective} tells curl to write out the last fetched url as the result to stdout -w %{url_effective}告诉curl将最后一次获取的url写为stdout的结果

The result will be that the effective url is written to stdout, and nothing else. 结果是有效的URL被写入stdout,而没有别的。

You're talking about following the redirection of the URL until it either times out, gets into a loop or resolves to a final address. 您正在谈论跟踪重定向URL,直到它超时,进入循环或解析为最终地址。

The Net::HTTP library has a Following Redirection example. Net :: HTTP库具有以下重定向示例。

Also, Ruby's open-uri module will automatically redirect, so I think you can ask it for the ending URL after you retrieve a page and find out where it landed. 此外,Ruby的open-uri模块将自动重定向,因此我认为您可以在检索页面并找到它所在的位置后询问它的结束URL。

require 'open-uri'

io = open('http://google.com')
body = io.read
io.base_uri.to_s # => "http://www.google.com/"

Notice that after reading the body the URL was redirected to Google's / dir. 请注意,在阅读正文后,网址被重定向到Google的/ dir。

Both cases will only handle server redirects. 这两种情况都只会处理服务器重定向。 For meta-redirects you'll have to look at the code, see where they're redirecting you and go there. 对于元重定向,您必须查看代码,看看他们在哪里重定向并去那里。

This will get you started: 这将让你开始:

require 'nokogiri'

doc = Nokogiri::HTML('<meta http-equiv="REFRESH" content="0;url=http://www.the-domain-you-want-to-redirect-to.com">')

redirect_url = (doc%'meta[@http-equiv="REFRESH"]')['content'].split('=').last rescue nil

cURL can retrieve HTTP headers. cURL可以检索HTTP标头。 Keep stepping through the chain until you're no longer getting Location: headers and the last Location: header you received is the final URL. 继续踩到链,直到你不再获得Location:标题,你收到的最后一个Location:标题是最终的URL。

The Mechanize gem is handy for this: Mechanize gem对此非常方便:

  agent = Mechanize.new {|a| a.user_agent_alias = 'Windows IE 7'}
  page = agent.get(url)
  final_url = page.uri.to_s

The solution I ended up using was simulating a browser, loading the ad, and clicking. 我最终使用的解决方案是模拟浏览器,加载广告和点击。 The click was the key ingredient. 点击是关键因素。 Solutions offered by others were good for a given URL but would not handle Flash, JavaScript, etc. Appreciate everyones' help. 其他人提供的解决方案对于给定的URL很有用,但不会处理Flash,JavaScript等。感谢每个人的帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM