简体   繁体   English

如何在Windows Form C#中获取与该关键字相关的所有网站

[英]How to Get all the websites related to the keyword in Windows Form C#

Here is my process: I have a textbox where user will enter the keyword, for example games , then after enter all the websites related to games will be output in the windows form. 这是我的过程:我有一个textbox ,用户将在其中输入关键字,例如游戏 ,然后在输入所有与游戏相关的网站后,将在Windows窗体中输出该textbox

Basically I tried using the Google Search API, using this code: 基本上,我尝试使用Google Search API,并使用以下代码:

const string apiKey = "";
const string searchEngineId = "";
const string query = "games";
CustomsearchService customSearchService = new CustomsearchService(new Google.Apis.Services.BaseClientService.Initializer() { ApiKey = apiKey });
Google.Apis.Customsearch.v1.CseResource.ListRequest listRequest = customSearchService.Cse.List(query);
listRequest.Cx = searchEngineId; 
Search search = listRequest.Execute();
foreach (var item in search.Items)
{
    Console.WriteLine("Title : " + item.Title + Environment.NewLine + "Link : " + item.Link + Environment.NewLine + Environment.NewLine);
}

But my problem is that the limitation of 100 query/day and 10 results/query is not applicable. 但是我的问题是,每天100个查询和10个结果/查询的限制不适用。

So I decided to use HttpWebRequest and HttpWebResponse approach, Here is the code which I saw from the internet: 因此,我决定使用HttpWebRequest和HttpWebResponse方法,这是我从互联网上看到的代码:

StringBuilder sb = new StringBuilder();

// used on each read operation
byte[] buf = new byte[8192];
string GS = "http://google.com/search?q=sample";
// prepare the web page we will be asking for
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(GS);

// execute the request
HttpWebResponse response = (HttpWebResponse)request.GetResponse();

// we will read data via the response stream
Stream resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
do
{
    // fill the buffer with data
    count = resStream.Read(buf, 0, buf.Length);
    // make sure we read some data
    if (count != 0)
    {
        // translate from bytes to ASCII text
        tempString = Encoding.ASCII.GetString(buf, 0, count);

        // continue building the string
        sb.Append(tempString);
    }
}
while (count > 0);

My problem with this is that it returns the whole HTML, Is it possible to get only the URL like using the Google Search API 我的问题是它会返回整个HTML,是否可能仅使用Google Search API来获取URL

这就是它的工作方式,您要么必须付费购买API,要么解析HTML-后者的合法性令人怀疑。

Using a html parser with css selectors, it is not that much work (solution is based on this java tutorial: http://mph-web.de/web-scraping-with-java-top-10-google-search-results/ ). 使用带有CSS选择器的html解析器,工作量不大(解决方案基于以下Java教程: http//mph-web.de/web-scraping-with-java-top-10-google-search-results / )。 I used Dcsoup ( https://github.com/matarillo/dcsoup incomplete Jsoup port) for the example, since I'm used to Jsoup ( https://jsoup.org/apidocs/ ), but there might be other html parsers for c# that are better maintained, etc. 我以Dcsoup( https://github.com/matarillo/dcsoup不完整的Jsoup端口)为例,因为我已经习惯了Jsoup( https://jsoup.org/apidocs/ ),但是可能还有其他html解析器对于更好维护的C#等

// query results on page 14, to demonstrate that limit of results is avoided
int resultPage = 130;
string keyword = "test";
string searchUrl = "http://www.google.com/search?q="+keyword+"&start="+resultPage;

System.Net.WebClient webClient = new System.Net.WebClient();
string htmlResult = webClient.DownloadString(searchUrl);

Supremes.Nodes.Document doc = Supremes.Dcsoup.Parse(htmlResult, "http://www.google.com/");

// parse with css selector
foreach (Supremes.Nodes.Element result in doc.Select("h3.r a")) 
{
    string title = result.Text;
    string url = result.Attr("href");

    // do something useful with the search result
    System.Diagnostics.Debug.WriteLine(title + " -> " + url);
}

The needed selector h3.ra might change. 所需的选择器h3.ra可能会更改。 A more stable alternative might be to parse all elements an retrieve those with href attribute or at least have a built-in check (check for a search term with a lot of results and parse and if there are no results for your selector, send you a notify, to repair the selector). 一种更稳定的选择是解析所有元素,然后检索具有href属性或至少具有内置检查的元素(检查包含大量结果的搜索字词并进行分析,如果选择器没有结果,请发送给您通知,以修复选择器)。

See also this answer regarding getting the results for the exact search term: https://stackoverflow.com/a/37268746/1661938 另请参阅以下有关获得确切搜索词结果的答案: https : //stackoverflow.com/a/372​​68746/1661938

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM