简体   繁体   English

如何使用Perl的LWP :: UserAgent获取具有不同查询字符串的相同URL?

[英]How can I fetch the same URL with different query string with Perl's LWP::UserAgent?

I looked up articles about using LWP however I am still lost! 我查阅了有关使用LWP的文章,但我仍然迷路! On this site we find a list of many schools; 在此站点上,我们找到了许多学校的列表。 see the overview-page and follow some of the links and get some result pages: 请参阅概述页面,并单击一些链接并获得一些结果页面:

I want to parse the sites using LWP::UserAgent and for the parsing : want to use either HTML::TreeBuilder::XPath or HTML::TokeParser 我想使用LWP :: UserAgent解析站点,并进行解析:想使用HTML :: TreeBuilder :: XPath或HTML :: TokeParser

At the moment I am musing bout choosing the right get-request! 此刻,我在想选择正确的获取请求! I have some issues with the LWP::Useragent. 我对LWP :: Useragent有一些问题。 The subsite of the overview can be reached via direct links. 概述的子站点可以通过直接链接访问。 but -note: each site has content. 但是-note:每个站点都有内容。 eg the following URLs of the above mentioned result-pages. 例如上述结果页的以下URL。

As a Novice here I cannot show you the endings of the different endings by posting the full URL but here you can see the endings: 作为新手,我无法通过发布完整的URL向您显示不同结尾的结尾,但是在这里您可以看到结尾:

id=21&extern_eid=709
id=21&extern_eid=789
id=21&extern_eid=1297
id=21&extern_eid=761

There are many different URLS that differ in the end of the URL. URL末尾有许多不同的URL。 The question is : how to i run LWP::UserAgent? 问题是:如何运行LWP :: UserAgent? I want fetch and parse & ** all the - 1000 sites.** 我想要提取并解析&**所有-1000个网站。**

Question; 题; Does LWP do the job automatically!? LWP是否会自动执行工作!? Or do i have to set up LWP :: UserAgent that it will look up the different URLS automatically... 还是我必须设置LWP :: UserAgent,它会自动查找不同的URL ...

Solutions: Perhaps we have to count up form zero to 10000 with there 解决方案:也许我们必须在那里从零开始计数到10000

extern_eid=709 -(count from zero to 100000) here extern_eid = 709-(从零到100000的计数)在这里

www-db.sn.schule.de/index.php?id=21&extern_eid=709 www-db.sn.schule.de/index.php?id=21&extern_eid=709

BTW: Here the data for LWP User Agent; BTW:这里是LWP用户代理的数据;

REQUEST METHODS The methods described in this section are used to dispatch requests via the user agent. 请求方法本节中描述的方法用于通过用户代理调度请求。 The following request methods are provided: 提供了以下请求方法:

$ua->get( $url ) $ua->get( $url , $field_name => $value, ... ) $ ua-> get($ url)$ ua-> get($ url,$ field_name => $ value,...)

This method will dispatch a GET request on the given $url. 此方法将在给定的$ url上调度GET请求。 Further arguments can be given to initialize the headers of the request. 可以提供其他参数来初始化请求的标头。 These are given as separate name/value pairs. 这些作为单独的名称/值对给出。 The return value is a response object. 返回值是一个响应对象。 See HTTP::Response for a description of the interface it provides. 有关其提供的接口的说明,请参见HTTP :: Response。 There will still be a response object returned when LWP can't connect to the server specified in the URL or when other failures in protocol handlers occur. 当LWP无法连接到URL中指定的服务器或协议处理程序中发生其他故障时,仍将返回一个响应对象。

The question is: How to use LWP::UserAgent on the above mentioned site the right way - effectively!? 问题是:如何在上述站点上正确使用LWP :: UserAgent-有效!

I look forward to any and all help! 我期待任何帮助!

If I understand your question correctly, you are trying to use LWP::UserAgent on same URLs with different query arguments, and you are wondering if LWP::UserAgent provides a way for you to loop through the query arguments? 如果我正确理解了您的问题,那么您正在尝试在具有不同查询参数的相同URL上使用LWP :: UserAgent,并且您想知道LWP :: UserAgent是否为您提供了一种遍历查询参数的方法?

I don't think LWP::UserAgent has a method for you to do that. 我不认为LWP :: UserAgent可以让您做到这一点。 However, you can have a loop constructing the URLs and use LWP::UserAgent repeatedly: 但是,您可以循环构建URL并重复使用LWP :: UserAgent:

for my $id (0 .. 100000)
{
    $ua->get($url."?id=21&extern_eid=".(709-$id))
    //rest of the code
}

Alternatively you can add a request_prepare handler that computes and add the query arguments before you send out the request. 或者,您可以添加一个request_prepare处理程序,该处理程序在发出请求之前进行计算并添加查询参数。

You describe following links for the purpose of web scraping. 您出于网络抓取目的描述了以下链接。 The LWP subclass WWW::Mechanize does this more easily than your current attempt. LWP子类WWW::Mechanize比您当前的尝试更容易做到这一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM