简体   繁体   English

通过HTTPS增量处理大型XML文件?

[英]Incrementally process large XML file over HTTPS?

I've got to download, process, and store an 8GB XML file from a secure web server. 我必须从安全的Web服务器下载,处理和存储8GB的XML文件。 I could download the file using the WebRequest class, but this will take a VERY long time. 我可以使用WebRequest类下载该文件,但这将花费很长时间。 Also, I know that the file is structured in such a way that it suits processing in discrete chunks. 此外,我知道该文件的结构使其适合离散块中的处理。

How can I 'stream' this file such that I only get bite-size pieces which I can work on, without having to get the whole stream at one time? 我如何'流式传输'这个文件,这样我只能获得一些可以处理的小块,而不必一次得到整个流?

Edit 编辑

I forgot to mention - we are hosted on Azure. 我忘了提 - 我们托管在Azure上。 An idea that comes to mind is to provision a worker role which just downloads large files and can take as long as it wants. 想到的一个想法是提供一个工作者角色,它只下载大文件并且可以根据需要进行。 How feasible would that be? 这有多可行?

8 GB is a large workload. 8 GB是一个很大的工作量。 To protect myself from rework and to scale effectively, I would decouple the XML file download from it's processing. 为了保护自己免于返工和有效扩展,我会将XML文件下载与其处理分离。

While downloading as a stream, I would write some sort of stream identifier to persistent storage and schedule each atomic unit of work to be done by placing a message with its relevant data on a queue. 在作为流下载时,我会将某种流标识符写入持久存储,并通过将包含其相关数据的消息放入队列来安排完成每个原子工作单元。 This would allow recovery from the download going south for any reason or a unit of work being unsuccessful and/or interfering with the download. 这将允许由于任何原因从下载向南恢复,或者单元工作不成功和/或干扰下载。

I'm using HttpWebRequest, BeginGetResponse then GetResponseStream 我正在使用HttpWebRequest,BeginGetResponse然后使用GetResponseStream

Then one can read the stream in chunks as it's dripping down via stream.BeginRead 然后,当它通过stream.BeginRead向下滴时,可以读取块中的流

Here's much too complicated example: http://stuff.seans.com/2009/01/05/using-httpwebrequest-for-asynchronous-downloads/ 这是一个非常复杂的例子: http//stuff.seans.com/2009/01/05/using-httpwebrequest-for-asynchronous-downloads/

If you need to process file sequentially just open an XMLReader on the stream of response and read the data as needed. 如果需要按顺序处理文件,只需在响应流上打开XMLReader并根据需要读取数据。

If you need random access to the file (ie read in the middle) you may need to do more work to create seekable stream (if server supports RANGE option in the request) or simply download whole file as you do now. 如果您需要随机访问该文件(即在中间读取),您可能需要做更多工作来创建可搜索流(如果服务器在请求中支持RANGE选项)或者只是像现在一样下载整个文件。

Please note that 8GB is large amount of data and downloading it completely will take a lot of time irrespective of method of reading. 请注意,8GB是大量数据,完全下载将花费大量时间,无论读取方法如何。

You could upload the xml file to a block blob and download it from there.This blog post might help http://blogs.msdn.com/b/kwill/archive/2011/05/30/asynchronous-parallel-block-blob-transfers-with-progress-change-notification.aspx 您可以将xml文件上传到块blob并从那里下载。这篇博客文章可能有所帮助http://blogs.msdn.com/b/kwill/archive/2011/05/30/asynchronous-parallel-block-blob -transfers,与正在进行的变化,notification.aspx

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM