异步抓取F＃

Question

When crawling on webpages I need to be careful as to not make too many requests to the same domain, for example I want to put 1 s between requests. 在网页上抓取时，我需要小心，不要向同一个域发出太多请求，例如我想在请求之间放置1秒。 From what I understand it is the time between requests that is important. 据我所知，这是请求之间的重要时间。 So to speed things up I want to use async workflows in F#, the idea being make your requests with 1 sec interval but avoid blocking things while waiting for request response. 因此，为了加快速度，我想在F＃中使用异步工作流，这个想法是以1秒的间隔发出请求，但在等待请求响应时避免阻塞。

let getHtmlPrimitiveAsyncTimer (uri : System.Uri) (timer:int) =
    async{

            let req =  (WebRequest.Create(uri)) :?> HttpWebRequest
            req.UserAgent<-"Mozilla"
            try 

                Thread.Sleep(timer)
                let! resp =    (req.AsyncGetResponse())
                Console.WriteLine(uri.AbsoluteUri+" got response")
                use stream = resp.GetResponseStream()
                use reader = new StreamReader(stream)
                let html = reader.ReadToEnd()
                return html
            with 
            | _ as ex -> return "Bad Link"
                 }

Then I do something like: 然后我做了类似的事情：

let uri1 = System.Uri "http://rue89.com"
let timer = 1000
let jobs = [|for i in 1..10 -> getHtmlPrimitiveAsyncTimer uri1 timer|]

jobs
|> Array.mapi(fun i job -> Console.WriteLine("Starting job "+string i)
                               Async.StartAsTask(job).Result)

Is this alright ? 这好吗？ I am very unsure about 2 things: -Does the Thread.Sleep thing work for delaying the request ? 我非常不确定两件事： - Thread.Sleep是否适用于延迟请求？ -Is using StartTask a problem ? - 使用StartTask有问题吗？

I am a beginner (as you may have noticed) in F# (coding in general actually ), and everything envolving Threads scares me :) 我是初学者（你可能已经注意到了）在F＃中（实际编码一般），并且所有涉及Threads的东西都让我害怕:)

Thanks !! 谢谢！！

Answer 1

I think what you want to do is - create 10 jobs, numbered 'n', each starting 'n' seconds from now - run those all in parallel 我想你想要做的是 - 创建10个工作，编号为'n'，每个从现在起'n'秒开始 - 并行运行

Approximately like 大概喜欢

let makeAsync uri n = async {
    // create the request
    do! Async.Sleep(n * 1000)
    // AsyncGetResponse etc
    }

let a = [| for i in 1..10 -> makeAsync uri i |]
let results = a |> Async.Parallel |> Async.RunSynchronously

Note that of course they all won't start exactly now, if eg you have a 4-core machine, 4 will start running very soon, but then quickly execute up to the Async.Sleep, at which point the next 4 will run up until their sleeps, and so forth. 请注意，当然它们都不会完全启动，例如，如果您有一台4核机器，4将很快开始运行，但随后快速执行Async.Sleep，此时接下来的4将会运行直到他们睡觉，等等。 And then in one second the first async wakes up and posts a request, and another second later the 2nd async wakes up, ... so this should work. 然后在一秒钟内第一个异步唤醒并发布一个请求，另一个秒后第二个异步唤醒，...所以这应该工作。 The 1s is only approximate, since they're starting their timers each a very tiny bit staggered from one another... you may want to buffer it a little, eg 1100 ms or something if the cut-off you need is really exactly a second (network latencies and whatnot still leave a bit of this outside the possible control of your program probably). 1s只是近似的，因为他们每个人的起始时间彼此错开一点......你可能想稍微缓冲一下，例如1100毫秒或者其他东西，如果你需要的截止点确实是一个第二（网络延迟，还有什么可能留下一些可能控制你的程序之外）。

Thread.Sleep is suboptimal, it will work ok for a small number of requests, but you're burning a thread, and threads are expensive and it won't scale to a large number. Thread.Sleep是次优的，对于少量请求它可以正常工作，但是你正在烧掉一个线程，并且线程很昂贵而且它不会扩展到很多。

You don't need StartAsTask unless you want to interoperate with .NET Tasks or later do a blocking rendezvous with the result via .Result . 您不需要StartAsTask除非您希望与.NET任务进行互操作，或者稍后通过.Result对结果进行阻塞集合。 If you just want these to all run and then block to collect all the results in an array, Async.Parallel will do that fork-join parallelism for you just fine. 如果你只是希望这些都运行然后阻塞来收集数组中的所有结果， Async.Parallel将为你做这个fork-join并行性就好了。 If they're just going to print results, you can fire-and-forget via Async.Start which will drop the results on the floor. 如果他们只打算打印结果，你可以通过Async.Start ，这会将结果丢弃在地板上。

(An alternative strategy is to use an agent as a throttle. Post all the http requests to a single agent, where the agent is logically single-threaded and sits in a loop, doing Async.Sleep for 1s, and then handling the next request. That's a nice way to make a general-purpose throttle... may be blog-worthy for me, come to think of it.) （另一种策略是使用代理作为限制。将所有http请求发布到单个代理，其中代理在逻辑上是单线程并且处于循环中，执行Async.Sleep 1 Async.Sleep ，然后处理下一个请求这是一个制作通用油门的好方法......可能对我而言值得博客，想到它。）

异步抓取F＃

问题描述

1 个解决方案

解决方案1
4 已采纳 2010-06-11 10:35:37

异步抓取F＃

问题描述

1 个解决方案

解决方案1 4 已采纳 2010-06-11 10:35:37

解决方案1
4 已采纳 2010-06-11 10:35:37