简体   繁体   English

如何使用Python将Azure Blob存储中的大型JSON文件拆分为每个记录的单个文件?

[英]How can I split large JSON file in Azure Blob Storage into individual files for each record using Python?

I want to be able to split some large JSON files in blob storage (~1GB each) into individual files (one file per record) 我希望能够将blob存储中的一些大型JSON文件(每个约1GB)拆分为单个文件(每个记录一个文件)

I have tried using get_blob_to_stream from the Azure Python SDK, but am getting the following error: 我尝试过使用Azure Python SDK中的get_blob_to_stream,但是收到以下错误:

AzureHttpError: Server failed to authenticate the request. AzureHttpError:服务器无法验证请求。 Make sure the value of Authorization header is formed correctly including the signature. 确保正确形成Authorization标头的值,包括签名。

To test, I've just been printing the text that has been downloaded from the blob, and haven't yet tried writing back to individual JSON files 为了测试,我刚刚打印了从blob下载的文本,还没有尝试写回单个JSON文件

with BytesIO() as document:
    block_blob_service = BlockBlobService(account_name=STORAGE_ACCOUNT_NAME, account_key=STORAGE_ACCOUNT_KEY)
    block_blob_service.get_blob_to_stream(container_name=CONTAINER_NAME, blob_name=BLOB_ID, stream=document)

    print(document.getvalue())

Interestingly, when I limit the size of the blob information that I'm downloading, the error message doesn't appear, and I can get some information out: 有趣的是,当我限制我正在下载的blob信息的大小时,错误消息不会出现,我可以得到一些信息:

with BytesIO() as document:
    block_blob_service = BlockBlobService(account_name=STORAGE_ACCOUNT_NAME, account_key=STORAGE_ACCOUNT_KEY)
    block_blob_service.get_blob_to_stream(container_name=CONTAINER_NAME, blob_name=BLOB_ID, stream=document, start_range=0, end_range=100000)

    print(document.getvalue())

Does anyone know what is going on here, or have any better approaches to splitting a large JSON out? 有谁知道这里发生了什么,或者有更好的方法来分割大型JSON?

Thanks! 谢谢!

This Error message "Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature" is you get usually when the header is not formed correctly. 此错误消息“服务器无法验证请求。确保正确形成授权标头的值,包括签名”,通常在未正确形成标头时获得。 you get following when you get this error: 当您收到此错误时,您会得到以下信息:

<?xml version="1.0" encoding="utf-8"?>
<Error>
    <Code>AuthenticationFailed</Code>
    <Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:096c6d73-f01e-0054-6816-e8eaed000000
Time:2019-03-31T23:08:43.6593937Z</Message>
    <AuthenticationErrorDetail>Authentication scheme Bearer is not supported in this version.</AuthenticationErrorDetail>
</Error>

and the solution to resolve this is to add below header: 解决这个问题的解决方案是添加以下标题:

x-ms-version: 2017-11-09

But since you are saying that it is working when you limit the size which means you have to write your code using chunk approach. 但是,因为你说当你限制大小时它正在工作,这意味着你必须使用chunk方法编写你的代码。 Here is something you can try. 这是你可以尝试的东西。

import io
import datetime
from azure.storage.blob import BlockBlobService

acc_name = 'myaccount'
acc_key = 'my key'
container = 'storeai'
blob = "orderingai2.csv"

block_blob_service = BlockBlobService(account_name=acc_name, account_key=acc_key)
props = block_blob_service.get_blob_properties(container, blob)
blob_size = int(props.properties.content_length)
index = 0
chunk_size =  104,858 # = 0.1meg don't make this to big or you will get memory error
output = io.BytesIO()


def worker(data):
    print(data)


while index < blob_size:
    now_chunk = datetime.datetime.now()
    block_blob_service.get_blob_to_stream(container, blob, stream=output, start_range=index, end_range=index + chunk_size - 1, max_connections=50)
    if output is None:
        continue
    output.seek(index)
    data = output.read()
    length = len(data)
    index += length
    if length > 0:
        worker(data)
        if length < chunk_size:
          break
    else:
      break

Hope it helps. 希望能帮助到你。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 How can I append JSON data to an existing JSON file stored in Azure blob storage through python? - How can I append JSON data to an existing JSON file stored in Azure blob storage through python? 如何在 azure 的 blob 存储中获取每个级别的文件夹和文件的数量或文件夹和文件的名称 - How can i get number of folders and files or name of folders and files at each level inside a blob storage of azure 如何使用 Azure 函数从 Blob 存储中读取 json 文件 Python - How to read json file from blob storage using Azure Functions Blob Trigger with Python 我如何通过多处理 python 将大型 json 文件拆分为较小的 json 文件 - How i can split large json file into smaller json files through multiprocessing python 如何在 Python 中使用 Azure Functions 的 Azure Blob 存储绑定将 JSON 数据上传到 Azure 存储 blob - How to upload JSON data to Azure storage blob using Azure Blob storage bindings for Azure Functions in Python Stream 文件到 Azure Blob 存储中的 Zip 文件,使用 ZA7F5F35426B9627411FC3Z231 - Stream Files to Zip File in Azure Blob Storage using Python? 使用 Azure-Storage-Blob Python 读取 Blob 容器目录中每个 Blob 的文件大小 - Reading the File size for each blob inside a directory of a Blob Container using Azure-Storage-Blob Python 如何直接从 Azure blob 存储读取文本文件而不将其下载到本地文件(使用 python)? - How can I read a text file from Azure blob storage directly without downloading it to a local file(using python)? 使用 python 将文件上传到 azure blob 存储 - Uploading file to azure blob storage using python 如何构造用于使用python将文件上传到Azure blob存储的授权标头? - How do I construct the Authorization header for uploading a file to Azure blob storage using python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM