简体繁体 English

机器人如何获取WordPress类别列表中后续页面的内容？

[英]How can a bot get the contents of subsequent pages in a category listing in WordPress?

原文 2012-08-05 13:49:44 4 1 wordpress/ http/ bots

I'm writing a bot to automatically download pages from my WordPress blog. 我正在编写一个自动从WordPress博客下载页面的机器人。 The bot gets most of the pages without a problem. 该漫游器可以毫无问题地获取大部分页面。 For example, it can easily get the first page of the article listing of a given tag: http://example.com/myblog/index.php/archives/tag/mytag . 例如，它可以轻松获取给定标签的文章列表的首页： http : //example.com/myblog/index.php/archives/tag/mytag 。 However, for some reason it can't get the subsequent pages, like http://example.com/myblog/index.php/archives/tag/mytag/page/2 . 但是，由于某种原因，它无法获得后续页面，例如http://example.com/myblog/index.php/archives/tag/mytag/page/2 。

I've tried to figure out what was going on, and here's what I found: while the server answers normally to most requests, upon such requests it answers with a 301 permanent redirect. 我试图弄清楚发生了什么，这就是我发现的结果：尽管服务器通常对大多数请求进行响应，但在此类请求中，服务器将使用301永久重定向进行响应。 Peculiarly, the Location header is set to the exact same URL as the request! 奇怪的是，Location标头设置为与请求完全相同的URL！ Basically, the server tells me to redirect my request of the page http://example.com/myblog/index.php/archives/tag/mytag/page/2 to... the very same page :P 基本上，服务器告诉我将页面http://example.com/myblog/index.php/archives/tag/mytag/page/2的请求重定向到同一页面：P

When trying to access the page from the browser I get the page without a problem. 尝试从浏览器访问页面时，我得到的页面没有问题。 I thought maybe the browser sends some headers (including cookies) that my bot doesn't send, so I copied the headers (including the cookies) from my browser's web console, but the behaviour didn't change. 我以为浏览器可能会发送一些我的机器人无法发送的标头（包括cookie），所以我从浏览器的网络控制台复制了标头（包括cookie），但是行为没有改变。

I would appreciate any suggestions regarding what might be causing this strange behaviour, what I can do in order to understand what's going on better, and of course what I can do in order to fetch those pages automatically, just like I fetch their brethren. 对于可能引起这种奇怪行为的任何建议，我将不胜感激，我可以做些什么以便更好地了解正在发生的事情，当然我可以做些什么以便自动获取这些页面，就像我获取其兄弟一样。

Thanks! 谢谢！

1 个解决方案

It seems this post hasn't generated much public interest. 似乎该帖子并未引起太多公众兴趣。 However, in case somebody ever runs into the same problem and finds this post, here's the solution I used. 但是，如果有人遇到相同的问题并找到这篇文章，这是我使用的解决方案。 Important note: I still don't understand the behaviour I witnessed, and would appreciate it if somebody could explain it. 重要说明：我仍然不了解我亲眼所见的行为，如果有人可以解释的话，我将不胜感激。

So the solution I've found is basically to use the URL http://example.com/myblog/archives/tag/mytag?paged=2 instead of http://example.com/myblog/index.php/archives/tag/mytag/page/2 . 因此，我找到的解决方案基本上是使用URL http://example.com/myblog/archives/tag/mytag?paged=2而不是http://example.com/myblog/index.php/archives/ tag / mytag / page / 2 。 Funnily enough, this URL gets redirected to the original one when browsed to from a browser! 有趣的是，从浏览器浏览到该URL时，该URL被重定向到原始URL！ But when the bot requested it it got the page without redirection or anything. 但是，当漫游器请求它时，它获得的页面没有重定向或任何内容。 (So I managed to do what I wanted to do, but I've got no idea what happened there, why there was a problem in the first place, and why this solution worked: for one URL the bot gets infinite redirection and the browser just gets the page, while for the other the browser gets redirected [finitely] and the bot gets the page. I am yet to figure this one out...) （所以我设法做了自己想做的事，但我不知道发生了什么，一开始为什么有问题，为什么这个解决方案起作用：对于一个URL，机器人会获得无限重定向，而浏览器只是获取页面，而其他浏览器[有限地]重定向，而机器人获取了页面。我还没有弄清楚这一点...）