简体   繁体   English

How to access a GCS Blob that contains an xml file in a bucket with the pandas.read_xml() function in python?

[英]How to access a GCS Blob that contains an xml file in a bucket with the pandas.read_xml() function in python?

I would like to access a blob file via the pandas.read_xml() function.我想通过 pandas.read_xml() function 访问一个 blob 文件。 Like this:像这样:

pandas.read_xml(blob.open())

When printing the blob it looks like this:打印 blob 时,它看起来像这样:

<Blob: Bucket, filename.0.xml.gz, 1612169959288959>

the blob.open() function gives this: blob.open() function 给出了这个:

<_io.TextIOWrapper encoding='iso-8859-1'>

and I get the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte .我收到错误UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte When I change the code to: blob.open(mode='rt', encoding='iso-8859-1') I get ther error lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1 .当我将代码更改为: blob.open(mode='rt', encoding='iso-8859-1')我得到错误lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Is there even a way to read in a xml file from a bucket on gcs?有没有办法从 gcs 上的存储桶中读取 xml 文件?

read_xml() can directly read GCS files. read_xml()可以直接读取 GCS 文件。 Just provide the GCS URI and it can transform it to a dataframe.只需提供 GCS URI,它就可以将其转换为 dataframe。 See sample code below and testing:请参阅下面的示例代码和测试:

Sample file stored in GCS:存储在 GCS 中的示例文件:

<?xml version="1.0" encoding="UTF-8"?>
<root xmlns="http://example.com">
    <bathrooms>
        <n35237 type="number">1.0</n35237>
        <n32238 type="number">3.0</n32238>
        <n44699 type="number">nan</n44699>
    </bathrooms>
    <price>
        <n35237 type="number">7020000.0</n35237>
        <n32238 type="number">10000000.0</n32238>
        <n44699 type="number">4128000.0</n44699>
    </price>
    <property_id>
        <n35237 type="number">35237.0</n35237>
        <n32238 type="number">32238.0</n32238>
        <n44699 type="number">44699.0</n44699>
    </property_id>
</root>

Code:代码:

import pandas as pd

df = pd.read_xml("gs://my-bucket/note.xml.gz",compression="gzip")

print(df)

Output: Output:

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 允许对 GCS 存储桶进行公共读取访问? - Allow Public Read access on a GCS bucket? 如何获取最后一个文件最后一个文件存放在gcs存储桶中(python) - How to get the last file last file deposited in a gcs bucket (python) 将最后添加的文件复制到 GCS 存储桶到 Azure Blob 存储 - Copy last added file to a GCS bucket into Azure Blob storage 使用 python 将文件上传到 gcs 存储桶中的文件夹 - upload file to a folder in gcs bucket using python 将 XML 从 GCS Blob 存储转换为不适用于特殊字符的字符串 - Converting an XML from GCS Blob storage to String not working for special characters 如何从谷歌数据流 apache 光束 python 中的 GCS 存储桶中读取多个 JSON 文件 - How to read multiple JSON files from GCS bucket in google dataflow apache beam python 如何使用 Python 中的 Pandas 从 s3 存储桶中读取 csv 文件 - How to read a csv file from an s3 bucket using Pandas in Python 如何从 gcs 存储桶中解压缩 tsv 文件并将其加载到 Bigquery - How to unzip and load tsv file into Bigquery from gcs bucket 如何使用 Cloud Function 触发器组合 GCS 存储桶中的多个文件 - How to combine multiple files in GCS bucket with Cloud Function trigger 如何将 FPDF output 从 python 发送到 GCS 桶? - How to send FPDF output to GCS bucket from python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM