简体   繁体   English

如何找到 2 个 pyarrow 数据集模式的不同之处?

[英]How to find where 2 pyarrow dataset schemas differ?

I have two pyarrow dataset schemas and for some reason they are different even though they should be the same (i assume that when storing one of the parquet files, for one partition certain column got cast to different data type, but i have no idea which one is it).我有两个 pyarrow 数据集模式,由于某种原因它们是不同的,即使它们应该是相同的(我假设在存储其中一个镶木地板文件时,对于一个分区,某些列被强制转换为不同的数据类型,但我不知道哪个一个是它)。

Now i know how to compare whether two schemas are the same.现在我知道如何比较两个模式是否相同。 I can do that like so:我可以这样做:

import pandas as pd
import numpy as np
import pyarrow as pa

df1 = pd.DataFrame({'col1': np.zeros(10), 'col2':np.random.rand(10)})
df2 = pd.DataFrame({'col1':np.ones(10), 'col2': np.zeros(10)})

schema_1 = pa.Schema.from_pandas(df1)
schema_2 = pa.Schema.from_pandas(df2)

schema_1.equals(schema_2)

df3 = df2.copy()
df3['col2'] = df3['col2'].astype('int')

schema_3 = pa.Schema.from_pandas(df3)
print(schema_1.equals(schema_2), schema_1.equals(schema_3))

But how do i find out where are they different?但是我如何找出它们的不同之处呢? (Visual inspection doesn't count, i briefly tried and haven't seen any difference in over 500 columns) (目视检查不算数,我短暂尝试过,但在 500 多列中没有发现任何差异)

Each schema is basically an ordered group of pyarrow.field types.每个模式基本上是一组有序的 pyarrow.field 类型。 Therefore, pyarrow.schema can have fields that are different in terms of name, type, and perhaps some of the other properties of the field type.因此,pyarrow.schema 可以具有在名称、类型以及字段类型的某些其他属性方面不同的字段。 Also, order is probably important as well.此外,顺序可能也很重要。

To find the fields in schema_3 that are not in schema_1, use sets.要查找 schema_3 中不在 schema_1 中的字段,请使用集合。

set(schema_3).difference(set(schema_1))

To find just the names of fields that are different use the.names property要仅查找不同字段的名称,请使用 .names 属性

set(schema_3.names).difference(set(schema_1.names))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM