简体   繁体   English

从Google搜索结果页面提取图像的URL

[英]Extracting a URL of an image from Google search result page

Google has added a nice feature which makes you get instant info about any of famous people, for example when you search for " Barack Obama " you get a bio and a photo on the results page so you may not have to visit any of the results to get that info. Google新增了一项不错的功能,使您可以立即获取有关任何名人的信息,例如,当您搜索“ Barack Obama ”时,您会在结果页面上获得简历和照片,因此您可能不必访问任何结果获取该信息。

Live sample : http://goo.gl/vf1ti3 实时示例: http//goo.gl/vf1ti3

What I'm trying to do is to get the URL of the image at the left-side of instant info box. 我想做的是在即时信息框的左侧获取图像的URL。 I want to accomplish that using System.Text.RegularExpressions.Regex from the HTML code. 我想使用HTML代码中的System.Text.RegularExpressions.Regex来实现。

I can get the source of the result page with this code : 我可以使用以下代码获取结果页面的源代码:

private void getInfoAboutCelebrities()
{
    try
    {
        string celebrityName = null;

        Dispatcher.Invoke((Action)delegate()
        {
            DisableUI();
            celebrityName = celebrityName_textBox.Text;
        });

        celebrityName = HttpUtility.UrlEncode(celebrityName);
        string queryURL = "http://www.google.com/search?q=" + celebrityName + "+Height&safe=active&oq=" + celebrityName + "+Height&gs_l=heirloom-serp.12...0.0.0.3140.0.0.0.0.0.0.0.0..0.0....0...1ac..24.heirloom-serp..0.0.0.hXJwfydNFhk";

        HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(queryURL);
        request.ContentType = "application/x-www-form-urlencoded";
        request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0";
        request.Method = "GET";
        // make request for web page
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        StreamReader htmlSource = new StreamReader(response.GetResponseStream());

        string htmlStringSource = string.Empty;
        htmlStringSource = htmlSource.ReadToEnd();
        response.Close();

        // Extracting height
        var regex = new Regex(@"<span class=""kno-a-v"">(.*?)</span>");
        var match = regex.Match(htmlStringSource);
        var result = match.Groups[1].Value;

        ///////////////////////////////////////////////////////////
        // Extracting photo ( which I couldn't do it
        regex = new Regex(@"data:image/jpeg;base64(.*?)\x3d\x3d");
        match = regex.Match(htmlStringSource);
        ///////////////////////////////////////////////////////////

        result = HttpUtility.HtmlDecode(result);

        if (String.IsNullOrWhiteSpace(result))
            MessageBox.Show("Sorry, no such entry.", "Error", MessageBoxButton.OK, MessageBoxImage.Error);
        else
        {
            Dispatcher.Invoke((Action)delegate()
            {
                preloader_Image.Visibility = Visibility.Hidden;
                MessageBox.Show(result);
            });
        }
        Dispatcher.Invoke((Action)EnableUI);
    }
    catch (Exception ex)
    {
        MessageBox.Show(ex.Message, "Error");
    }
}

Can anyone tell me what Regular Expression I should use? 谁能告诉我应该使用什么正则表达式 ( Because actually I can't even get the URL myself with viewing the source code! ) (因为实际上我自己甚至无法通过查看源代码来获取URL!)

It's quite likely that the image URL isn't even in the HTML that you get back. 图像URL很可能甚至不在您返回的HTML中。 There's a whole lot of Javascript on that page. 该页面上有很多Javascript。 The page is intended to be viewed in a browser, which can run the Javascript and download images, format the page, etc. There's no guarantee that the information displayed is available in the HTML. 该页面旨在在浏览器中查看,该浏览器可以运行Javascript并下载图像,格式化页面等。不能保证显示的信息在HTML中可用。

I suspect, however, that the image you're looking for is the embedded image that's base64 encoded near the end of the file. 但是,我怀疑您要查找的图像是在文件末尾附近以base64编码的嵌入式图像。 Search for imgthumb13 , and you'll find it. 搜索imgthumb13 ,您将找到它。 Probably you can convert that to binary and then decode the image. 可能您可以将其转换为二进制,然后解码图像。 If you know the image format. 如果您知道图像格式。 (No, I don't.) (不,我没有。)

Google's results pages are not at all designed to be read by bots or scrapers. Google的结果页根本不适合机器人或抓取程序阅读。 And in fact Google frowns on you using a scraper to read their results pages. 实际上,Google会使用刮板读取您的搜索结果页面,从而对您皱眉。 If they determine that you're using a scraper on their pages, they'll block you. 如果他们确定您在他们的页面上使用刮板,则会阻止您。 If you want to process Google search results, then you should be using the Google Search API . 如果要处理Google搜索结果,则应使用Google Search API

Also see Any form of Google Search API available for C#? 还请参阅适用于C#的任何形式的Google搜索API? .

One other thing. 另一件事。 Google is continually changing the format of their search results pages. Google不断改变其搜索结果页面的格式。 Even when the pages look the same, the internal structure can be much different. 即使页面看起来相同,内部结构也可能有很大不同。 You'll find that the code you write to scrape today's search results pages is likely to break next month. 您会发现,为抓取今天的搜索结果页而编写的代码可能在下个月中断。 I learned that one the hard way. 我是很难学到的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM