[英]Writing Pandas DataFrame to Parquet file?
我正在使用 pandas.read_sql 讀取數據塊並附加到鑲木地板文件但出現錯誤
使用 pyarrow.parquet:
import pyarrow as pa
import pyarrow.parquet as pq
for chunk in pd.read_sql_query(query , conn, chunksize=10000):
table_data = pa.Table.from_pandas(chunk) #converting df to arrow
pq.write_table(table=table_data,where=file.paraquet,
use_deprecated_int96_timestamps=True,
coerce_timestamps='ms',allow_truncated_timestamps=True)
收到以下錯誤:
File "pyarrow\_parquet.pyx", line 1427, in pyarrow._parquet.ParquetWriter.__cinit__
File "pyarrow\error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: duration[ns]
使用快速鑲木地板:
from fastparquet import write
for chunk in pd.read_sql_query(query , conn, chunksize=10000):
with open(os.path.join(download_location, file_name+ '.parquet'),mode="a+") as f:
write(filename=f.name,data=chunk,append=True)
收到以下錯誤:
raise ValueError("Can't infer object conversion type: %s" % head)
ValueError: Can't infer object conversion type: 0 2021-09-06
是否有任何解決方案可以將 pandas dataframe 轉換為鑲木地板文件(附加模式)而沒有日期時間列問題?
解決方案是將 pandas dataframe 塊轉換為 str 並寫入 parquet 文件
query = f'select * from `{table["table_name"]}`'
for i,chunk in enumerate(pd.read_sql_query(query , conn, chunksize=10000)):
all_columns = list(chunk) # Creates list of all column headers
chunk[all_columns] = chunk[all_columns].astype(str)
#convert data to string
table_schema =chunk.dtypes.astype(str).to_dict()
for k,v in table_schema.items():
if v == 'object':
table_schema[k]=pa.string()
#create pyarrow schema for string format
fields = [pa.field(x, y) for x, y in table_schema.items()]
new_schema = pa.schema(fields)
if i == 0:
pqwriter = pq.ParquetWriter(where=updated_path,schema=new_schema, compression='snappy')
table = pa.Table.from_pandas(df=chunk, schema=new_schema,preserve_index=False,safe=False)
pqwriter.write_table(table)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.