[英]How to access a GCS Blob that contains an xml file in a bucket with the pandas.read_xml() function in python?
I would like to access a blob file via the pandas.read_xml() function.我想通过 pandas.read_xml() function 访问一个 blob 文件。 Like this:
像这样:
pandas.read_xml(blob.open())
When printing the blob it looks like this:打印 blob 时,它看起来像这样:
<Blob: Bucket, filename.0.xml.gz, 1612169959288959>
the blob.open()
function gives this: blob.open()
function 给出了这个:
<_io.TextIOWrapper encoding='iso-8859-1'>
and I get the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
.我收到错误
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
。 When I change the code to: blob.open(mode='rt', encoding='iso-8859-1')
I get ther error lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
.当我将代码更改为:
blob.open(mode='rt', encoding='iso-8859-1')
我得到错误lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
。
Is there even a way to read in a xml file from a bucket on gcs?有没有办法从 gcs 上的存储桶中读取 xml 文件?
read_xml()
can directly read GCS files. read_xml()
可以直接读取 GCS 文件。 Just provide the GCS URI and it can transform it to a dataframe.只需提供 GCS URI,它就可以将其转换为 dataframe。 See sample code below and testing:
请参阅下面的示例代码和测试:
Sample file stored in GCS:存储在 GCS 中的示例文件:
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns="http://example.com">
<bathrooms>
<n35237 type="number">1.0</n35237>
<n32238 type="number">3.0</n32238>
<n44699 type="number">nan</n44699>
</bathrooms>
<price>
<n35237 type="number">7020000.0</n35237>
<n32238 type="number">10000000.0</n32238>
<n44699 type="number">4128000.0</n44699>
</price>
<property_id>
<n35237 type="number">35237.0</n35237>
<n32238 type="number">32238.0</n32238>
<n44699 type="number">44699.0</n44699>
</property_id>
</root>
Code:代码:
import pandas as pd
df = pd.read_xml("gs://my-bucket/note.xml.gz",compression="gzip")
print(df)
Output: Output:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.