繁体   English   中英

使用PHP从必须登录(Reddit)的网站上爬取和使用数据?

[英]Scraping and using data using PHP from a website that must be logged on to (Reddit)?

我想创建一个网页,给定两个reddit用户名和它们的密码,将user2预订到user1所预订的所有子预订。 所以我需要:

  1. 获取user1订阅的子目录。
  2. 订阅user2到这些reddits

我有使用PHP的经验,但是我没有进行爬网的经验(特别是当用户必须登录时),也没有提交将用户“订阅”到subreddit所需的信息类型。 有谁对如何做到这一点有任何想法吗?

问候,

提姆

假设这不违反reddits的服务条款,使用cURL登录, regex可以很容易地将必要的信息进行regex 在这里,需要检查reddit如何订阅收藏夹,并导航到正确的url或发布表单数据。

只要不违反reddit服务条款,我就将其称为中级任务。

开源产品TestPlan在这些方面非常擅长。 您可以使用一种简单的语言与一个用户一起登录站点,获取子目录的名称,然后以其他用户身份登录以订阅组。

例如,如果您只想要顶部条目的标题,则可以使用以下代码:

GotoURL http://www.reddit.com/top/

set %Topics% as response //p[@class='title']
foreach %Topic% in %Topics%
    set %Title% as selectIn %Topic% string(.)
    Notice %Title%
end

产生如下输出:

00000000-00 GOTOURL http://www.reddit.com/top/
00000001-00 NOTICE LEGAL DVD vs. PIRATED COPY (i.imgur.com)
00000002-00 NOTICE Don't just shorten your URL, make it suspicious and frightening. - ShadyURL (shadyurl.com)
00000003-00 NOTICE HOLY CRAP! IS THAT A ROOM FOR RENT ON MY CRAIGSLIST??!?!? (houston.craigslist.org)
00000004-00 NOTICE Years from now when our children ask us, "What did we do after 9/11?" we shall explain it to them using this... (4gifs.com)
00000005-00 NOTICE TSA forces disabled boy to remove leg braces and walk through the metal detector. "I told him, 'This is overkill. He's 4 years old. I don't think he's a terrorist.' " (philly.com)
00000006-00 NOTICE This picture scares the shit out of me. (imgur.com)
00000007-00 NOTICE Civilization V Announced, in Development at Firaxis Games (hellforge.gameriot.com)
00000008-00 NOTICE I don't know, the price seems a little steep... [pic] (i.imgur.com)
00000009-00 NOTICE Reddit, last week we saw the depth of the ocean scaled relative to human size. I made a figure of the depth of the ocean accurately scaled to the width. It's really very shallow from this perspective. (i.imgur.com)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM