简体   繁体   English

如何在超时后取消任务等待

[英]How to cancel Task await after a timeout period

I am using this method to instantiate a web browser programmatically, navigate to a url and return a result when the document has completed.我正在使用此方法以编程方式实例化 Web 浏览器,导航到 url 并在文档完成后返回结果。

How would I be able to stop the Task and have GetFinalUrl() return null if the document takes more than 5 seconds to load?如果文档加载时间超过 5 秒,我如何能够停止Task并让GetFinalUrl()返回null

I have seen many examples using a TaskFactory but I haven't been able to apply it to this code.我已经看到许多使用TaskFactory示例,但我无法将其应用于此代码。

 private Uri GetFinalUrl(PortalMerchant portalMerchant)
    {
        SetBrowserFeatureControl();
        Uri finalUri = null;
        if (string.IsNullOrEmpty(portalMerchant.Url))
        {
            return null;
        }
        Uri trackingUrl = new Uri(portalMerchant.Url);
        var task = MessageLoopWorker.Run(DoWorkAsync, trackingUrl);
        task.Wait();
        if (!String.IsNullOrEmpty(task.Result.ToString()))
        {
            return new Uri(task.Result.ToString());
        }
        else
        {
            throw new Exception("Parsing Failed");
        }
    }

// by Noseratio - http://stackoverflow.com/users/1768303/noseratio    

static async Task<object> DoWorkAsync(object[] args)
{
    _threadCount++;
    Console.WriteLine("Thread count:" + _threadCount);
    Uri retVal = null;
    var wb = new WebBrowser();
    wb.ScriptErrorsSuppressed = true;

    TaskCompletionSource<bool> tcs = null;
    WebBrowserDocumentCompletedEventHandler documentCompletedHandler = (s, e) => tcs.TrySetResult(true);

    foreach (var url in args)
    {
        tcs = new TaskCompletionSource<bool>();
        wb.DocumentCompleted += documentCompletedHandler;
        try
        {
            wb.Navigate(url.ToString());
            await tcs.Task;
        }
        finally
        {
            wb.DocumentCompleted -= documentCompletedHandler;
        }

        retVal = wb.Url;
        wb.Dispose();
        return retVal;
    }
    return null;
}

public static class MessageLoopWorker
{
    #region Public static methods

    public static async Task<object> Run(Func<object[], Task<object>> worker, params object[] args)
    {
        var tcs = new TaskCompletionSource<object>();

        var thread = new Thread(() =>
        {
            EventHandler idleHandler = null;

            idleHandler = async (s, e) =>
            {
                // handle Application.Idle just once
                Application.Idle -= idleHandler;

                // return to the message loop
                await Task.Yield();

                // and continue asynchronously
                // propogate the result or exception
                try
                {
                    var result = await worker(args);
                    tcs.SetResult(result);
                }
                catch (Exception ex)
                {
                    tcs.SetException(ex);
                }

                // signal to exit the message loop
                // Application.Run will exit at this point
                Application.ExitThread();
            };

            // handle Application.Idle just once
            // to make sure we're inside the message loop
            // and SynchronizationContext has been correctly installed
            Application.Idle += idleHandler;
            Application.Run();
        });

        // set STA model for the new thread
        thread.SetApartmentState(ApartmentState.STA);

        // start the thread and await for the task
        thread.Start();
        try
        {
            return await tcs.Task;
        }
        finally
        {
            thread.Join();
        }
    }
    #endregion
}

Updated : the latest version of the WebBrowser -based console web scraper can be found on Github .更新:在最新版本的WebBrowser为基础的控制台网站刮板可以在Github上找到

Updated : Adding a pool of WebBrowser objects for multiple parallel downloads.更新:为多个并行下载添加一个WebBrowser对象池

Do you have an example of how to do this in a console app by any chance?您是否有任何机会在控制台应用程序中执行此操作的示例? Also I don't think webBrowser can be a class variable because I am running the whole thing in a parallell for each, iterating thousands of URLs此外,我不认为 webBrowser 可以是一个类变量,因为我正在为每个并行运行整个事物,迭代数千个 URL

Below is an implementation of more or less generic ** WebBrowser -based web scraper **, which works as console application.下面是一个或多或少通用的 ** 基于WebBrowser的网络爬虫** 的实现,它作为控制台应用程序工作。 It's a consolidation of some of my previous WebBrowser -related efforts, including the code referenced in the question:它整合了我之前与WebBrowser相关的一些工作,包括问题中引用的代码:

A few points:几点:

  • Reusable MessageLoopApartment class is used to start and run a WinForms STA thread with its own message pump.可重用MessageLoopApartment类用于启动和运行带有自己的消息泵的 WinForms STA 线程。 It can be used from a console application , as below.它可以从控制台应用程序中使用,如下所示。 This class exposes a TPL Task Scheduler ( FromCurrentSynchronizationContext ) and a set of Task.Factory.StartNew wrappers to use this task scheduler.此类公开了一个 TPL 任务调度程序 ( FromCurrentSynchronizationContext ) 和一组Task.Factory.StartNew包装器以使用此任务调度程序。

  • This makes async/await a great tool for running WebBrowser navigation tasks on that separate STA thread.这使得async/await成为在单独的 STA 线程上运行WebBrowser导航任务的绝佳工具。 This way, a WebBrowser object gets created, navigated and destroyed on that thread.这样,在该线程上创建、导航和销毁WebBrowser对象。 Although, MessageLoopApartment is not tied up to WebBrowser specifically.虽然, MessageLoopApartment并没有专门绑定到WebBrowser

  • It's important to enable HTML5 rendering using Browser Feature Control , as otherwise the WebBrowser obejcts runs in IE7 emulation mode by default.使用Browser Feature Control启用 HTML5 渲染非常重要,否则默认情况下WebBrowser对象将在 IE7 仿真模式下运行。 That's what SetFeatureBrowserEmulation does below.这就是SetFeatureBrowserEmulation下面所做的。

  • It may not always be possible to determine when a web page has finished rendering with 100% probability.并非总是可以确定网页何时以 100% 的概率完成呈现。 Some pages are quite complex and use continuous AJAX updates.一些页面非常复杂并且使用持续的 AJAX 更新。 Yet we can get quite close, by handling DocumentCompleted event first, then polling the page's current HTML snapshot for changes and checking the WebBrowser.IsBusy property.然而,我们可以通过首先处理DocumentCompleted事件,然后轮询页面的当前 HTML 快照以进行更改并检查WebBrowser.IsBusy属性来非常接近。 That's what NavigateAsync does below.这就是NavigateAsync在下面所做的。

  • A time-out logic is present on top of the above, in case the page rendering is never-ending (note CancellationTokenSource and CreateLinkedTokenSource ).超时逻辑存在于上述之上,以防页面呈现永无止境(注意CancellationTokenSourceCreateLinkedTokenSource )。

using Microsoft.Win32;
using System;
using System.Threading;
using System.Threading.Tasks;
using System.Windows.Forms;

namespace Console_22239357
{
    class Program
    {
        // by Noseratio - https://stackoverflow.com/a/22262976/1768303

        // main logic
        static async Task ScrapeSitesAsync(string[] urls, CancellationToken token)
        {
            using (var apartment = new MessageLoopApartment())
            {
                // create WebBrowser inside MessageLoopApartment
                var webBrowser = apartment.Invoke(() => new WebBrowser());
                try
                {
                    foreach (var url in urls)
                    {
                        Console.WriteLine("URL:\n" + url);

                        // cancel in 30s or when the main token is signalled
                        var navigationCts = CancellationTokenSource.CreateLinkedTokenSource(token);
                        navigationCts.CancelAfter((int)TimeSpan.FromSeconds(30).TotalMilliseconds);
                        var navigationToken = navigationCts.Token;

                        // run the navigation task inside MessageLoopApartment
                        string html = await apartment.Run(() =>
                            webBrowser.NavigateAsync(url, navigationToken), navigationToken);

                        Console.WriteLine("HTML:\n" + html);
                    }
                }
                finally
                {
                    // dispose of WebBrowser inside MessageLoopApartment
                    apartment.Invoke(() => webBrowser.Dispose());
                }
            }
        }

        // entry point
        static void Main(string[] args)
        {
            try
            {
                WebBrowserExt.SetFeatureBrowserEmulation(); // enable HTML5

                var cts = new CancellationTokenSource((int)TimeSpan.FromMinutes(3).TotalMilliseconds);

                var task = ScrapeSitesAsync(
                    new[] { "http://example.com", "http://example.org", "http://example.net" },
                    cts.Token);

                task.Wait();

                Console.WriteLine("Press Enter to exit...");
                Console.ReadLine();
            }
            catch (Exception ex)
            {
                while (ex is AggregateException && ex.InnerException != null)
                    ex = ex.InnerException;
                Console.WriteLine(ex.Message);
                Environment.Exit(-1);
            }
        }
    }

    /// <summary>
    /// WebBrowserExt - WebBrowser extensions
    /// by Noseratio - https://stackoverflow.com/a/22262976/1768303
    /// </summary>
    public static class WebBrowserExt
    {
        const int POLL_DELAY = 500;

        // navigate and download 
        public static async Task<string> NavigateAsync(this WebBrowser webBrowser, string url, CancellationToken token)
        {
            // navigate and await DocumentCompleted
            var tcs = new TaskCompletionSource<bool>();
            WebBrowserDocumentCompletedEventHandler handler = (s, arg) =>
                tcs.TrySetResult(true);

            using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true))
            {
                webBrowser.DocumentCompleted += handler;
                try
                {
                    webBrowser.Navigate(url);
                    await tcs.Task; // wait for DocumentCompleted
                }
                finally
                {
                    webBrowser.DocumentCompleted -= handler;
                }
            }

            // get the root element
            var documentElement = webBrowser.Document.GetElementsByTagName("html")[0];

            // poll the current HTML for changes asynchronosly
            var html = documentElement.OuterHtml;
            while (true)
            {
                // wait asynchronously, this will throw if cancellation requested
                await Task.Delay(POLL_DELAY, token);

                // continue polling if the WebBrowser is still busy
                if (webBrowser.IsBusy)
                    continue;

                var htmlNow = documentElement.OuterHtml;
                if (html == htmlNow)
                    break; // no changes detected, end the poll loop

                html = htmlNow;
            }

            // consider the page fully rendered 
            token.ThrowIfCancellationRequested();
            return html;
        }

        // enable HTML5 (assuming we're running IE10+)
        // more info: https://stackoverflow.com/a/18333982/1768303
        public static void SetFeatureBrowserEmulation()
        {
            if (System.ComponentModel.LicenseManager.UsageMode != System.ComponentModel.LicenseUsageMode.Runtime)
                return;
            var appName = System.IO.Path.GetFileName(System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName);
            Registry.SetValue(@"HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION",
                appName, 10000, RegistryValueKind.DWord);
        }
    }

    /// <summary>
    /// MessageLoopApartment
    /// STA thread with message pump for serial execution of tasks
    /// by Noseratio - https://stackoverflow.com/a/22262976/1768303
    /// </summary>
    public class MessageLoopApartment : IDisposable
    {
        Thread _thread; // the STA thread

        TaskScheduler _taskScheduler; // the STA thread's task scheduler

        public TaskScheduler TaskScheduler { get { return _taskScheduler; } }

        /// <summary>MessageLoopApartment constructor</summary>
        public MessageLoopApartment()
        {
            var tcs = new TaskCompletionSource<TaskScheduler>();

            // start an STA thread and gets a task scheduler
            _thread = new Thread(startArg =>
            {
                EventHandler idleHandler = null;

                idleHandler = (s, e) =>
                {
                    // handle Application.Idle just once
                    Application.Idle -= idleHandler;
                    // return the task scheduler
                    tcs.SetResult(TaskScheduler.FromCurrentSynchronizationContext());
                };

                // handle Application.Idle just once
                // to make sure we're inside the message loop
                // and SynchronizationContext has been correctly installed
                Application.Idle += idleHandler;
                Application.Run();
            });

            _thread.SetApartmentState(ApartmentState.STA);
            _thread.IsBackground = true;
            _thread.Start();
            _taskScheduler = tcs.Task.Result;
        }

        /// <summary>shutdown the STA thread</summary>
        public void Dispose()
        {
            if (_taskScheduler != null)
            {
                var taskScheduler = _taskScheduler;
                _taskScheduler = null;

                // execute Application.ExitThread() on the STA thread
                Task.Factory.StartNew(
                    () => Application.ExitThread(),
                    CancellationToken.None,
                    TaskCreationOptions.None,
                    taskScheduler).Wait();

                _thread.Join();
                _thread = null;
            }
        }

        /// <summary>Task.Factory.StartNew wrappers</summary>
        public void Invoke(Action action)
        {
            Task.Factory.StartNew(action,
                CancellationToken.None, TaskCreationOptions.None, _taskScheduler).Wait();
        }

        public TResult Invoke<TResult>(Func<TResult> action)
        {
            return Task.Factory.StartNew(action,
                CancellationToken.None, TaskCreationOptions.None, _taskScheduler).Result;
        }

        public Task Run(Action action, CancellationToken token)
        {
            return Task.Factory.StartNew(action, token, TaskCreationOptions.None, _taskScheduler);
        }

        public Task<TResult> Run<TResult>(Func<TResult> action, CancellationToken token)
        {
            return Task.Factory.StartNew(action, token, TaskCreationOptions.None, _taskScheduler);
        }

        public Task Run(Func<Task> action, CancellationToken token)
        {
            return Task.Factory.StartNew(action, token, TaskCreationOptions.None, _taskScheduler).Unwrap();
        }

        public Task<TResult> Run<TResult>(Func<Task<TResult>> action, CancellationToken token)
        {
            return Task.Factory.StartNew(action, token, TaskCreationOptions.None, _taskScheduler).Unwrap();
        }
    }
}

I suspect running a processing loop on another thread will not work out well, since WebBrowser is a UI component that hosts an ActiveX control.我怀疑在另一个线程上运行处理循环效果不佳,因为WebBrowser是一个承载 ActiveX 控件的 UI 组件。

When you're writing TAP over EAP wrappers , I recommend using extension methods to keep the code clean:当您通过 EAP 包装器编写TAP 时,我建议使用扩展方法来保持代码整洁:

public static Task<string> NavigateAsync(this WebBrowser @this, string url)
{
  var tcs = new TaskCompletionSource<string>();
  WebBrowserDocumentCompletedEventHandler subscription = null;
  subscription = (_, args) =>
  {
    @this.DocumentCompleted -= subscription;
    tcs.TrySetResult(args.Url.ToString());
  };
  @this.DocumentCompleted += subscription;
  @this.Navigate(url);
  return tcs.Task;
}

Now your code can easily apply a timeout:现在您的代码可以轻松应用超时:

async Task<string> GetUrlAsync(string url)
{
  using (var wb = new WebBrowser())
  {
    var navigate = wb.NavigateAsync(url);
    var timeout = Task.Delay(TimeSpan.FromSeconds(5));
    var completed = await Task.WhenAny(navigate, timeout);
    if (completed == navigate)
      return await navigate;
    return null;
  }
}

which can be consumed as such:可以这样消费:

private async Task<Uri> GetFinalUrlAsync(PortalMerchant portalMerchant)
{
  SetBrowserFeatureControl();
  if (string.IsNullOrEmpty(portalMerchant.Url))
    return null;
  var result = await GetUrlAsync(portalMerchant.Url);
  if (!String.IsNullOrEmpty(result))
    return new Uri(result);
  throw new Exception("Parsing Failed");
}

I'm trying to take benefit from Noseratio's solution as well as following advices from Stephen Cleary.我正在尝试从 Noseratio 的解决方案中受益,并遵循 Stephen Cleary 的建议。

Here is the code I updated to include in the code from Stephen the code from Noseratio regarding the AJAX tip.这是我更新的代码,包含在来自斯蒂芬的代码中,来自 Noseratio 的关于 AJAX 技巧的代码。

First part: the Task NavigateAsync advised by Stephen第一部分:Stephen 建议的Task NavigateAsync

public static Task<string> NavigateAsync(this WebBrowser @this, string url)
{
  var tcs = new TaskCompletionSource<string>();
  WebBrowserDocumentCompletedEventHandler subscription = null;
  subscription = (_, args) =>
  {
    @this.DocumentCompleted -= subscription;
    tcs.TrySetResult(args.Url.ToString());
  };
  @this.DocumentCompleted += subscription;
  @this.Navigate(url);
  return tcs.Task;
}

Second part: a new Task NavAjaxAsync to run the tip for AJAX (based on Noseratio's code)第二部分:一个新的Task NavAjaxAsync来运行 AJAX 的提示(基于 Noseratio 的代码)

public static async Task<string> NavAjaxAsync(this WebBrowser @this)
{
  // get the root element
  var documentElement = @this.Document.GetElementsByTagName("html")[0];

  // poll the current HTML for changes asynchronosly
  var html = documentElement.OuterHtml;

  while (true)
  {
    // wait asynchronously
    await Task.Delay(POLL_DELAY);

    // continue polling if the WebBrowser is still busy
    if (webBrowser.IsBusy)
      continue;

    var htmlNow = documentElement.OuterHtml;
    if (html == htmlNow)
      break; // no changes detected, end the poll loop

    html = htmlNow;
  }

  return @this.Document.Url.ToString();
}

Third part: a new Task NavAndAjaxAsync to get the navigation and the AJAX第三部分:一个新的Task NavAndAjaxAsync来获取导航和 AJAX

public static async Task NavAndAjaxAsync(this WebBrowser @this, string url)
{
  await @this.NavigateAsync(url);
  await @this.NavAjaxAsync();
}

Fourth and last part: the updated Task GetUrlAsync from Stephen with Noseratio's code for AJAX第四部分也是最后一部分:Stephen 更新的Task GetUrlAsync和 Noseratio 的 AJAX 代码

async Task<string> GetUrlAsync(string url)
{
  using (var wb = new WebBrowser())
  {
    var navigate = wb.NavAndAjaxAsync(url);
    var timeout = Task.Delay(TimeSpan.FromSeconds(5));
    var completed = await Task.WhenAny(navigate, timeout);
    if (completed == navigate)
      return await navigate;
    return null;
  }
}

I'd like to know if this is the right approach.我想知道这是否是正确的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM