[英]Why can Pyarrow read additional index column while Pandas dataframe cannot?
I have the following code:我有以下代码:
import pandas as pd
import dask.dataframe as da
from pyarrow.parquet import ParquetFile
df = pd.DataFrame([1, 2, 3], columns=["value"])
my_dataset = da.from_pandas(df, chunksize=3)
save_dir = './local/'
my_dataset.to_parquet(save_dir)
pa = ParquetFile("./local/part.0.parquet")
print(pa.schema.names)
df2 = pd.read_parquet("./local/part.0.parquet")
print(df2.columns)
The output is: output 是:
['value', '__null_dask_index__']
Index(['value'], dtype='object')
Just curious, why did Pandas dataframe
ignore __null_dask_index__
column name?只是好奇,为什么 Pandas
dataframe
忽略__null_dask_index__
列名? Or is __null_dask_index__
not considered as a column?或者
__null_dask_index__
不被视为一列?
pandas
will read the __null_dask_index__
and use it (correctly) as an index, so it doesn't show up in the list of columns. pandas
将读取__null_dask_index__
并将其(正确地)用作索引,因此它不会显示在列列表中。 To see this clearly, specify a custom index (eg 4,5,6) and then inspect the head of the df2
dataframe:要清楚地看到这一点,请指定一个自定义索引(例如 4、5、6),然后检查
df2
dataframe 的头部:
from pandas import DataFrame
from dask.dataframe import from_pandas
from pyarrow.parquet import ParquetFile
df = DataFrame([1, 2, 3], columns=["value"], index=[4,5,6])
my_dataset = from_pandas(df, chunksize=2)
save_dir = './local/'
my_dataset.to_parquet(save_dir)
pa = ParquetFile("./local/part.0.parquet")
print(pa.schema.names)
from pandas import read_parquet
df2 = read_parquet("./local/part.0.parquet")
print(df2.head())
# value
# __null_dask_index__
# 4 1
# 5 2
The parquet files created by dask and pandas (via arrow or fastparquet) contain a special metadata area specifying column and index attributes for use by pandas/dask, but arrow does not know about it by itself. dask 和 pandas(通过 arrow 或 fastparquet)创建的镶木地板文件包含一个特殊的元数据区域,指定列和索引属性供 pandas/dask 使用,但 arrow 本身并不知道。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.