简体   繁体   English

机器人如何获取WordPress类别列表中后续页面的内容?

[英]How can a bot get the contents of subsequent pages in a category listing in WordPress?

I'm writing a bot to automatically download pages from my WordPress blog. 我正在编写一个自动从WordPress博客下载页面的机器人。 The bot gets most of the pages without a problem. 该漫游器可以毫无问题地获取大部分页面。 For example, it can easily get the first page of the article listing of a given tag: http://example.com/myblog/index.php/archives/tag/mytag . 例如,它可以轻松获取给定标签的文章列表的首页: http : //example.com/myblog/index.php/archives/tag/mytag However, for some reason it can't get the subsequent pages, like http://example.com/myblog/index.php/archives/tag/mytag/page/2 . 但是,由于某种原因,它无法获得后续页面,例如http://example.com/myblog/index.php/archives/tag/mytag/page/2

I've tried to figure out what was going on, and here's what I found: while the server answers normally to most requests, upon such requests it answers with a 301 permanent redirect. 我试图弄清楚发生了什么,这就是我发现的结果:尽管服务器通常对大多数请求进行响应,但在此类请求中,服务器将使用301永久重定向进行响应。 Peculiarly, the Location header is set to the exact same URL as the request! 奇怪的是,Location标头设置为与请求完全相同的URL! Basically, the server tells me to redirect my request of the page http://example.com/myblog/index.php/archives/tag/mytag/page/2 to... the very same page :P 基本上,服务器告诉我将页面http://example.com/myblog/index.php/archives/tag/mytag/page/2的请求重定向到同一页面:P

When trying to access the page from the browser I get the page without a problem. 尝试从浏览器访问页面时,我得到的页面没有问题。 I thought maybe the browser sends some headers (including cookies) that my bot doesn't send, so I copied the headers (including the cookies) from my browser's web console, but the behaviour didn't change. 我以为浏览器可能会发送一些我的机器人无法发送的标头(包括cookie),所以我从浏览器的网络控制台复制了标头(包括cookie),但是行为没有改变。

I would appreciate any suggestions regarding what might be causing this strange behaviour, what I can do in order to understand what's going on better, and of course what I can do in order to fetch those pages automatically, just like I fetch their brethren. 对于可能引起这种奇怪行为的任何建议,我将不胜感激,我可以做些什么以便更好地了解正在发生的事情,当然我可以做些什么以便自动获取这些页面,就像我获取其兄弟一样。

Thanks! 谢谢!

It seems this post hasn't generated much public interest. 似乎该帖子并未引起太多公众兴趣。 However, in case somebody ever runs into the same problem and finds this post, here's the solution I used. 但是,如果有人遇到相同的问题并找到这篇文章,这是我使用的解决方案。 Important note: I still don't understand the behaviour I witnessed, and would appreciate it if somebody could explain it. 重要说明:我仍然不了解我亲眼所见的行为,如果有人可以解释的话,我将不胜感激。

So the solution I've found is basically to use the URL http://example.com/myblog/archives/tag/mytag?paged=2 instead of http://example.com/myblog/index.php/archives/tag/mytag/page/2 . 因此,我找到的解决方案基本上是使用URL http://example.com/myblog/archives/tag/mytag?paged=2而不是http://example.com/myblog/index.php/archives/ tag / mytag / page / 2 Funnily enough, this URL gets redirected to the original one when browsed to from a browser! 有趣的是,从浏览器浏览到该URL时,该URL被重定向到原始URL! But when the bot requested it it got the page without redirection or anything. 但是,当漫游器请求它时,它获得的页面没有重定向或任何内容。 (So I managed to do what I wanted to do, but I've got no idea what happened there, why there was a problem in the first place, and why this solution worked: for one URL the bot gets infinite redirection and the browser just gets the page, while for the other the browser gets redirected [finitely] and the bot gets the page. I am yet to figure this one out...) (所以我设法做了自己想做的事,但我不知道发生了什么,一开始为什么有问题,为什么这个解决方案起作用:对于一个URL,机器人会获得无限重定向,而浏览器只是获取页面,而其他浏览器[有限地]重定向,而机器人获取了页面。我还没有弄清楚这一点...)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM