如何使用Python将Azure Blob存储中的大型JSON文件拆分为每个记录的单个文件？

Question

I want to be able to split some large JSON files in blob storage (~1GB each) into individual files (one file per record) 我希望能够将blob存储中的一些大型JSON文件（每个约1GB）拆分为单个文件（每个记录一个文件）

I have tried using get_blob_to_stream from the Azure Python SDK, but am getting the following error: 我尝试过使用Azure Python SDK中的get_blob_to_stream，但是收到以下错误：

AzureHttpError: Server failed to authenticate the request. AzureHttpError：服务器无法验证请求。 Make sure the value of Authorization header is formed correctly including the signature. 确保正确形成Authorization标头的值，包括签名。

To test, I've just been printing the text that has been downloaded from the blob, and haven't yet tried writing back to individual JSON files 为了测试，我刚刚打印了从blob下载的文本，还没有尝试写回单个JSON文件

with BytesIO() as document:
    block_blob_service = BlockBlobService(account_name=STORAGE_ACCOUNT_NAME, account_key=STORAGE_ACCOUNT_KEY)
    block_blob_service.get_blob_to_stream(container_name=CONTAINER_NAME, blob_name=BLOB_ID, stream=document)

    print(document.getvalue())

Interestingly, when I limit the size of the blob information that I'm downloading, the error message doesn't appear, and I can get some information out: 有趣的是，当我限制我正在下载的blob信息的大小时，错误消息不会出现，我可以得到一些信息：

with BytesIO() as document:
    block_blob_service = BlockBlobService(account_name=STORAGE_ACCOUNT_NAME, account_key=STORAGE_ACCOUNT_KEY)
    block_blob_service.get_blob_to_stream(container_name=CONTAINER_NAME, blob_name=BLOB_ID, stream=document, start_range=0, end_range=100000)

    print(document.getvalue())

Does anyone know what is going on here, or have any better approaches to splitting a large JSON out? 有谁知道这里发生了什么，或者有更好的方法来分割大型JSON？

Thanks! 谢谢！

Answer 1

This Error message "Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature" is you get usually when the header is not formed correctly. 此错误消息“服务器无法验证请求。确保正确形成授权标头的值，包括签名”，通常在未正确形成标头时获得。 you get following when you get this error: 当您收到此错误时，您会得到以下信息：

<?xml version="1.0" encoding="utf-8"?>
<Error>
    <Code>AuthenticationFailed</Code>
    <Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:096c6d73-f01e-0054-6816-e8eaed000000
Time:2019-03-31T23:08:43.6593937Z</Message>
    <AuthenticationErrorDetail>Authentication scheme Bearer is not supported in this version.</AuthenticationErrorDetail>
</Error>

and the solution to resolve this is to add below header: 解决这个问题的解决方案是添加以下标题：

x-ms-version: 2017-11-09

But since you are saying that it is working when you limit the size which means you have to write your code using chunk approach. 但是，因为你说当你限制大小时它正在工作，这意味着你必须使用chunk方法编写你的代码。 Here is something you can try. 这是你可以尝试的东西。

import io
import datetime
from azure.storage.blob import BlockBlobService

acc_name = 'myaccount'
acc_key = 'my key'
container = 'storeai'
blob = "orderingai2.csv"

block_blob_service = BlockBlobService(account_name=acc_name, account_key=acc_key)
props = block_blob_service.get_blob_properties(container, blob)
blob_size = int(props.properties.content_length)
index = 0
chunk_size =  104,858 # = 0.1meg don't make this to big or you will get memory error
output = io.BytesIO()


def worker(data):
    print(data)


while index < blob_size:
    now_chunk = datetime.datetime.now()
    block_blob_service.get_blob_to_stream(container, blob, stream=output, start_range=index, end_range=index + chunk_size - 1, max_connections=50)
    if output is None:
        continue
    output.seek(index)
    data = output.read()
    length = len(data)
    index += length
    if length > 0:
        worker(data)
        if length < chunk_size:
          break
    else:
      break

Hope it helps. 希望能帮助到你。

如何使用Python将Azure Blob存储中的大型JSON文件拆分为每个记录的单个文件？

问题描述

1 个解决方案

解决方案1
0 2019-06-24 10:54:45

如何使用Python将Azure Blob存储中的大型JSON文件拆分为每个记录的单个文件？

问题描述

1 个解决方案

解决方案1 0 2019-06-24 10:54:45

解决方案1
0 2019-06-24 10:54:45