Pyarrow/Parquet - 在批处理期间将所有 null 列转换为字符串

Question

我的代码有一个问题，我暂时无法解决。

我正在尝试将 tar.gz 压缩的 csv 文件转换为镶木地板。 文件本身在未压缩时大约有 700MB。 处理是在内存受限的系统中完成的，所以我必须分批处理文件。 我想出了如何将 tar.gz 读取为 stream，提取我需要的文件并使用 pyarrow 的open_csv()来读取批次。 从这里开始，我想通过批量写入将数据保存到 parquet 文件中。 这就是问题出现的地方。 该文件本身有很多没有任何值的列。 但是偶尔，在第 500.000 行或其他地方出现一个值，所以 pyarrow 无法正确识别数据类型。 因此，大多数列都是null 。 我的想法是修改架构并将这些列转换为string ，因此任何值都是有效的。 修改模式工作正常，但是当我运行代码时，出现此错误。

Traceback (most recent call last):
  File "b:\snippets\tar_converter.py", line 38, in <module>
    batch = reader.read_next_batch()
  File "pyarrow\ipc.pxi", line 682, in pyarrow.lib.RecordBatchReader.read_next_batch
  File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: In CSV column #49: CSV conversion error to null: invalid value '0.0000'

第 38 行是这一行：

batch = reader.read_next_batch()

有谁知道如何将模式强制执行到批次，所以这是我的代码。

import io
import os
import tarfile
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.csv as csv
import logging

srcs = list()
path = "C:\\data"
for root, dirs, files in os.walk(path):
   for name in files:
       if name.endswith("tar.gz"):
           srcs.append(os.path.join(root, name))

for source_file_name in srcs:

file_name: str = source_file_name.replace(".tar.gz", "")
target_file_name: str = source_file_name.replace(".tar.gz", ".parquet")
clean_file_name: str = os.path.basename(source_file_name.replace(".tar.gz", ""))

# download CSV file, preserving folder structure
logging.info(f"Processing '{source_file_name}'.")
with io.open(source_file_name, "rb") as file_obj_in:
    # unpack all files to temp_path
    file_obj_in.seek(0)
    with tarfile.open(fileobj=file_obj_in, mode="r") as tf:
        file_obj = tf.extractfile(f"{clean_file_name}.csv")
        file_obj.seek(0)
        reader = csv.open_csv(file_obj, read_options=csv.ReadOptions(block_size=25*1024*1024))
        schema = reader.schema
        null_cols = list()
        for index, entry in enumerate(schema.types):
            if entry.equals(pa.null()):
                schema = schema.set(index, schema.field(index).with_type(pa.string()))
                null_cols.append(index)
        
        with pq.ParquetWriter(target_file_name, schema) as writer:
            while True:
                try:
                    batch = reader.read_next_batch()
                    table = pa.Table.from_batches(batches=[batch]).cast(target_schema=schema)
                    batch = table.to_batches()[0]
                    writer.write_batch(batch)
                except StopIteration:
                    break

另外，我可以省略这部分：

batch = reader.read_next_batch()
table = pa.Table.from_batches(batches=[batch]).cast(target_schema=schema)
batch = table.to_batches()[0]

但是然后错误是这样的（缩短），表明架构更改至少有效。

Traceback (most recent call last):
  File "b:\snippets\tar_converter.py", line 39, in <module>
    writer.write_batch(batch)
  File "C:\Users\me\AppData\Roaming\Python\Python39\site-packages\pyarrow\parquet\__init__.py", line 981, in write_batch
    self.write_table(table, row_group_size)
  File "C:\Users\me\AppData\Roaming\Python\Python39\site-packages\pyarrow\parquet\__init__.py", line 1004, in write_table
    raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:
table:
ACCOUNT_NAME: string
BOOK_VALUE: double
ESTIMATED_TO_REALISE: double
VAT_PAYABLE_ID: null
VAT_RECEIVABLE_ID: null
MONTHLY_AMOUNT_EFFECTIVE_DATE: null vs.
file:
ACCOUNT_NAME: string
BOOK_VALUE: double
ESTIMATED_TO_REALISE: double
VAT_PAYABLE_ID: string
VAT_RECEIVABLE_ID: string
MONTHLY_AMOUNT_EFFECTIVE_DATE: string

谢谢！

Answer 1

所以我想我明白了。 想把它贴给有类似问题的人。 另外，感谢所有看过并提供帮助的人！

我通过读取文件两次来解决这个问题。 在第一次运行中，我只将第一批读入 stream 以获取架构。 然后，将 null 列转换为字符串并关闭 stream（如果使用相同的变量名，这很重要）。 在此之后，您再次读取文件，但现在将修改后的模式作为 ReadOption 传递给读取器。 感谢 @0x26res，他的评论给了我灵感。

# get initial schema by reading one batch
initial_reader = csv.open_csv(file_obj, read_options=csv.ReadOptions(block_size=16*1024*1024))
schema = initial_reader.schema
for index, entry in enumerate(schema.types):
    if entry.equals(pa.null()):
        schema = schema.set(index, schema.field(index).with_type(pa.string()))

# now use the modified schema for reader
# must close old reader first, otherwise wrong data is loaded
file_obj.close()
file_obj = tf.extractfile(f"{file_name}.csv")
file_obj.seek(0)
reader = csv.open_csv(file_obj,
                      read_options=csv.ReadOptions(block_size=16*1024*1024),
                      convert_options=csv.ConvertOptions(column_types=schema))

Pyarrow/Parquet - 在批处理期间将所有 null 列转换为字符串

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-06-16 10:07:32

Pyarrow/Parquet - 在批处理期间将所有 null 列转换为字符串

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-06-16 10:07:32

解决方案1
0 已采纳 2022-06-16 10:07:32