简体   繁体   English

在 c++ 中读取镶木地板文件比在 python 中慢

[英]Reading parquet file is slower in c++ than in python

I have written code to read the same parquet file using c++ and using python.我编写了代码来使用 c++ 和 python 读取相同的镶木地板文件。 The time taken to read the file is much less for python than in c++, but as generally we know, execution in c++ is faster than in python. python 读取文件所需的时间比 c++ 中的要少得多,但是众所周知,c++ 中的执行速度比 Z23EEEB7347BDD55DDDZDB6 中的要快。 I have attached the code here -我在这里附上了代码 -

#include <arrow/api.h>
#include <parquet/arrow/reader.h>
#include <arrow/filesystem/localfs.h>
#include <chrono>
#include <iostream>

int main(){
   // ...
   arrow::Status st;
   arrow::MemoryPool* pool = arrow::default_memory_pool();
   arrow::fs::LocalFileSystem file_system;
   std::shared_ptr<arrow::io::RandomAccessFile> input = file_system.OpenInputFile("data.parquet").ValueOrDie();

   // Open Parquet file reader
   std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
   st = parquet::arrow::OpenFile(input, pool, &arrow_reader);
   if (!st.ok()) {
      // Handle error instantiating file reader...
   }

   // Read entire file as a single Arrow table
   std::shared_ptr<arrow::Table> table;
   auto t1 = std::chrono::high_resolution_clock::now();
   st = arrow_reader->ReadTable(&table);
   auto t2 = std::chrono::high_resolution_clock::now();
   if (!st.ok()) {
      // Handle error reading Parquet data...
   }
   else{
       auto ms_int = std::chrono::duration_cast<std::chrono::milliseconds> (t2 - t1);
       std::cout << "Time taken to read parquet file is : " << ms_int.count() << "ms\n";
   }
}

The code i used in python is -我在 python 中使用的代码是 -

#!/usr/bin/env python3
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import time

start_time = time.time()

table = pq.read_table('data.parquet')

end_time = time.time()

print("Time taken to read parquet is : ",(end_time - start_time)*1000, "ms")

On running the c++ code for a file of size about 87mb, the output for c++ is -在为大小约为 87mb 的文件运行 c++ 代码时,c++ 的 output 是 -

Time taken to read parquet file is: 186ms读取 parquet 文件所需的时间是:186ms

While for python the output is -而对于 python output 是 -

Time taken to read parquet is: 108.66141319274902 ms读取镶木地板所需的时间为:108.66141319274902 ms

Why there is such a difference in time of execution for the function read_table in c++ and python?为什么c++和python中的function read_table的执行时间会有这么大的差异?

The python pq.read_table is based on the exact same C++ APIs as you are using in your example (under the hood it is also using C++ parquet::arrow::FileReader ), as both the Python and C++ APIs come from the same Arrow project. The python pq.read_table is based on the exact same C++ APIs as you are using in your example (under the hood it is also using C++ parquet::arrow::FileReader ), as both the Python and C++ APIs come from the same Arrow项目。
So except for a tiny bit of Python call stack overhead, it would be expected that both ways will perform the same.因此,除了一点点 Python 调用堆栈开销外,预计两种方式的执行方式相同。

There are however several options you can specify / tune to improve performance, which can explain the difference in your case.但是,您可以指定/调整几个选项以提高性能,这可以解释您的情况的差异。 For example, the python function will read the file in parallel by default (you can specify use_threads=False to disable this).例如,python function 默认会并行读取文件(您可以指定use_threads=False禁用此功能)。 The C++ FileReader on the other hand doesn't do this by default (check set_use_threads ).另一方面,C++ FileReader 默认不这样做(检查set_use_threads )。 There might be other options that the python reader sets by default as well. python 阅读器默认设置的可能还有其他选项。
And in addition, the exact build flags when compiling your C++ example can also have an influence.此外,编译 C++ 示例时的确切构建标志也会产生影响。

It is likely that the Python module is bound to functions compiled in a language such as c++ or using cython. Python 模块很可能绑定到使用 c++ 等语言或使用 cython 编译的函数。 The implementation of the python module may thus have better performance, depending on how it reads from the file or processes data.因此,python 模块的实现可能具有更好的性能,具体取决于它如何从文件中读取或处理数据。

1 second is 1000 milliseconds. 1 秒是 1000 毫秒。 So the difference is not quite as large.所以差异并没有那么大。 That aside, many functions of python often utilize CPython, which puts them on a very even playing field.除此之外,python 的许多功能经常使用 CPython,这使它们处于非常公平的竞争环境中。 Then it just depends on how well the function is written and optimized.然后它只取决于 function 的编写和优化程度。 In this case, it is likely that the python function was more optimized than the C++ one was.在这种情况下,python function 可能比 C++ 更优化。

If you want a comparison try this CPP code:如果您想进行比较,请尝试以下 CPP 代码:

#include <cassert>
#include <chrono>
#include <cstdlib>
#include <iostream>

using namespace std::chrono;

#include <arrow/api.h>
#include <arrow/filesystem/api.h>
#include <parquet/arrow/reader.h>

using arrow::Result;
using arrow::Status;

namespace {

Result<std::unique_ptr<parquet::arrow::FileReader>> OpenReader() {
  arrow::fs::LocalFileSystem file_system;
  ARROW_ASSIGN_OR_RAISE(auto input, file_system.OpenInputFile("data.parquet"));

  parquet::ArrowReaderProperties arrow_reader_properties =
      parquet::default_arrow_reader_properties();

  arrow_reader_properties.set_pre_buffer(true);
  arrow_reader_properties.set_use_threads(true);

  parquet::ReaderProperties reader_properties =
      parquet::default_reader_properties();

  // Open Parquet file reader
  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
  auto reader_builder = parquet::arrow::FileReaderBuilder();
  reader_builder.properties(arrow_reader_properties);
  ARROW_RETURN_NOT_OK(reader_builder.Open(std::move(input), reader_properties));
  ARROW_RETURN_NOT_OK(reader_builder.Build(&arrow_reader));

  return arrow_reader;
}

Status RunMain(int argc, char **argv) {
  // Read entire file as a single Arrow table
  std::shared_ptr<arrow::Table> table;
  for (auto i = 0; i < 10; i++) {
    ARROW_ASSIGN_OR_RAISE(auto arrow_reader, OpenReader());
    auto t1 = std::chrono::high_resolution_clock::now();
    ARROW_RETURN_NOT_OK(arrow_reader->ReadTable(&table));
    std::cout << table->num_rows() << "," << table->num_columns() << std::endl;
    auto t2 = std::chrono::high_resolution_clock::now();
    auto ms_int =
        std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1);
    std::cout << "Time taken to read parquet file is : " << ms_int.count()
              << "ms\n";
  }

  return Status::OK();
}

} // namespace

int main(int argc, char **argv) {
  Status st = RunMain(argc, argv);
  if (!st.ok()) {
    std::cerr << st << std::endl;
    return 1;
  }
  return 0;
}

Then compare with this python code:然后与这个 python 代码比较:

#!/usr/bin/env python3                                                                                                                                                                                     
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import time

for i in range(10):
    parquet_file = pq.ParquetFile('/home/pace/experiments/so4/data.parquet', pre_buffer=True)
    start_time = time.time()
    table = parquet_file.read()
    end_time = time.time()
    print("Time taken to read parquet is : ",(end_time - start_time)*1000, "ms")

On my system after 10 runs a t-test fails to distinguish the two distributions (p=0.64).在我的系统上运行 10 次后,t 检验无法区分这两个分布(p=0.64)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM