简体   繁体   English

如何与Internet Explorer C ++进行交互

[英]How to interact with internet explorer C++

I have a school project that I am working on and the outcome is pointless it seems, but it's got more to do with the experience gained through this I believe. 我正在进行一个学校项目,结果似乎毫无意义,但我认为,这与从中获得的更多经验有关。 What I am trying to do is submit an initial URL, then pull all the URLs on that page and visit them in order and do this until I tell it to stop. 我想做的是提交一个初始URL,然后拉出该页面上的所有URL并按顺序访问它们,直到我告诉它停止为止。 All of the URLs will be recorded in a text file. 所有URL都将记录在一个文本文件中。 So far, I am able to open a window in IE and launch a webpage of my choosing. 到目前为止,我已经能够在IE中打开一个窗口并启动自己选择的网页。 So now I need to know how to send IE to a new webpage using the same session and also how I can scan and pull data from the websites I visit. 因此,现在我需要知道如何使用同一会话将IE发送到新网页,以及如何从访问的网站扫描和提取数据。 Thanks for any help! 谢谢你的帮助!

Here is my code so far: 到目前为止,这是我的代码:

#include <string>
#include <iostream>
#include <windows.h>
#include <stdio.h>
#include <tchar.h>

using namespace std;

int main( int argc, TCHAR *argv[] )
{
    std::string uRL, prog;
    int length, count;

    STARTUPINFO si;
    PROCESS_INFORMATION pi;

    ZeroMemory( &si, sizeof(si) );
    si.cb = sizeof(si);
    ZeroMemory( &pi, sizeof(pi) );

    //if( argc != 2 )
    //{
    //    printf("Usage: %s [cmdline]\n", argv[0]);
    //    system("PAUSE");
    //    return 0;
    //}

    std::cout << "Enter URL: ";
    std::cin >> uRL;

    prog = ("C:\\Program Files\\Internet Explorer\\iexplore.exe ") + uRL;

    char *cstr = new char[prog.length() + 1];
    strcpy(cstr, prog.c_str());

    // Start the child process. 
    if( !CreateProcess(NULL,   // No module name (use command line)
        _T(cstr),        // Command line
        NULL,           // Process handle not inheritable
        NULL,           // Thread handle not inheritable
        FALSE,          // Set handle inheritance to FALSE
        0,              // No creation flags
        NULL,           // Use parent's environment block
        NULL,           // Use parent's starting directory 
        &si,            // Pointer to STARTUPINFO structure
        &pi )           // Pointer to PROCESS_INFORMATION structure
    ) 
    {
        printf( "CreateProcess failed (%d).\n", GetLastError() );
        system("PAUSE");
        return 0;
    }

    cout << HRESULT get_Count(long *Count) << endl;

    //cout << count << endl;

    system("PAUSE");

    // Wait until child process exits.
    WaitForSingleObject( pi.hProcess, INFINITE );

    // Close process and thread handles. 
    CloseHandle( pi.hProcess );
    CloseHandle( pi.hThread );

    delete [] cstr;

    return 0;
}

If you want to crawl a webpage launching Internet Explorer is not going to work very well. 如果要爬网网页,则启动Internet Explorer不能很好地工作。 I also don't recommend attempting to parse the HTML page yourself unless you are prepared for a lot of heartache and hassle. 我也不建议您自己尝试解析HTML页面,除非您准备承受很多麻烦和麻烦。 Instead I recommend that you create an instance of an IWebBrowser2 object and use it to navigate to the webpage, grab the appropriate IHTMLDocument2 object and iterate through the elements picking out the URL's. 相反,我建议您创建一个IWebBrowser2对象的实例,并使用它来导航到网页,获取适当的IHTMLDocument2对象,然后遍历元素以挑选URL。 It's far easier and is a common approach using components that are already installed on Windows. 这要容易得多,并且是使用Windows上已安装的组件的常用方法。 The example below should get your started and on your way to crawling the web like proper spider should. 下面的示例将使您开始,并像正确的Spider一样开始爬网。

#include <comutil.h>    // _variant_t
#include <mshtml.h>     // IHTMLDocument and IHTMLElement
#include <exdisp.h>     // IWebBrowser2
#include <atlbase.h>    // CComPtr
#include <string>
#include <iostream>
#include <vector>

// Make sure we link in the support library!
#pragma comment(lib, "comsuppw.lib")


// Load a webpage
HRESULT LoadWebpage(
    const CComBSTR& webpageURL,
    CComPtr<IWebBrowser2>& browser,
    CComPtr<IHTMLDocument2>& document)
{
    HRESULT hr;
    VARIANT empty;

    VariantInit(&empty);

    // Navigate to the specifed webpage
    hr = browser->Navigate(webpageURL, &empty, &empty, &empty, &empty);

    //  Wait for the load.
    if(SUCCEEDED(hr))
    {
        READYSTATE state;

        while(SUCCEEDED(hr = browser->get_ReadyState(&state)))
        {
            if(state == READYSTATE_COMPLETE) break;
        }
    }

    // The browser now has a document object. Grab it.
    if(SUCCEEDED(hr))
    {
        CComPtr<IDispatch> dispatch;

        hr = browser->get_Document(&dispatch);
        if(SUCCEEDED(hr) && dispatch != NULL)
        {
            hr = dispatch.QueryInterface<IHTMLDocument2>(&document);
        }
        else
        {
            hr = E_FAIL;
        }
    }

    return hr;
}


void CrawlWebsite(const CComBSTR& webpage, std::vector<std::wstring>& urlList)
{
    HRESULT hr;

    // Create a browser object
    CComPtr<IWebBrowser2> browser;
    hr = CoCreateInstance(
        CLSID_InternetExplorer,
        NULL,
        CLSCTX_SERVER,
        IID_IWebBrowser2,
        reinterpret_cast<void**>(&browser));

    // Grab a web page
    CComPtr<IHTMLDocument2> document;
    if(SUCCEEDED(hr))
    {
        // Make sure these two items are scoped so CoUninitialize doesn't gump
        // us up.
        hr = LoadWebpage(webpage, browser, document);
    }

    // Grab all the anchors!
    if(SUCCEEDED(hr))
    {
        CComPtr<IHTMLElementCollection> urls;
        long count = 0;

        hr = document->get_all(&urls);

        if(SUCCEEDED(hr))
        {
            hr = urls->get_length(&count);
        }

        if(SUCCEEDED(hr))
        {
            for(long i = 0; i < count; i++)
            {
                CComPtr<IDispatch>  element;
                CComPtr<IHTMLAnchorElement> anchor;

                // Get an IDispatch interface for the next option.
                _variant_t index = i;
                hr = urls->item( index, index, &element);
                if(SUCCEEDED(hr))
                {
                    hr = element->QueryInterface(
                        IID_IHTMLAnchorElement, 
                        reinterpret_cast<void **>(&anchor));
                }

                if(SUCCEEDED(hr) && anchor != NULL)
                {
                    CComBSTR    url;
                    hr = anchor->get_href(&url);
                    if(SUCCEEDED(hr) && url != NULL)
                    {
                        urlList.push_back(std::wstring(url));
                    }
                }
            }
        }
    }
}

int main()
{
    HRESULT hr;

    hr = CoInitialize(NULL);
    std::vector<std::wstring>   urls;

    CComBSTR webpage(L"http://cppreference.com");


    CrawlWebsite(webpage, urls);
    for(std::vector<std::wstring>::iterator it = urls.begin();
        it != urls.end();
        ++it)
    {
        std::wcout << "URL: " << *it << std::endl;

    }

    CoUninitialize();

    return 0;
}

To scan and pull data from the websites, you'll want to capture the HTML and iterate through it looking for all character sequences matching a certain pattern. 要扫描和提取网站中的数据,您需要捕获HTML并对其进行迭代,以查找与特定模式匹配的所有字符序列。 Have you ever used regular expressions? 您曾经使用过正则表达式吗? Regular expressions would by far be the best here, but if you understand them (just look up a tutorial on the basics) then you can manually apply the pattern-recognition concepts to this project. 正则表达式到目前为止可能是最好的,但是如果您理解了它们(只需查找基础知识的教程),则可以将模式识别概念手动应用于该项目。

So what you're looking for is something like http(s)://.. It's more complex though, because domain names are a rather intricate pattern. 因此,您要查找的是类似http(s)://。的文件,但是它更复杂,因为域名是一种非常复杂的模式。 You'll probably want to use a third-party HTML parser or regular expression library, but it's doable without it, although pretty tedious to program. 您可能想要使用第三方HTML解析器或正则表达式库,但是即使没有编程,它也可以使用。

Here's a link about regular expressions in c++: http://www.johndcook.com/cpp_regex.html 这是有关c ++中正则表达式的链接: http : //www.johndcook.com/cpp_regex.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM