简体   繁体   English

在 Rust 中逐行高效构建 Polars DataFrame

[英]Efficiently build a Polars DataFrame row by row in Rust

I would like to create a large Polars DataFrame using Rust, building it up row by row using data scraped from web pages.我想使用 Rust 创建一个大的 Polars DataFrame ,使用从 web 页中抓取的数据逐行构建它。 What is an efficient way to do this?执行此操作的有效方法是什么?

It looks like the DataFrame should be created from a Vec of Series rather than adding rows to an empty DataFrame. However, how should a Series be built up efficiently?看起来DataFrame应该从SeriesVec创建,而不是向空的 DataFrame 添加行。但是,应该如何有效地构建Series I could create a Vec and then create a Series from the Vec , but that sounds like it will end up copying all elements.我可以创建一个Vec ,然后从Vec创建一个Series ,但这听起来像是最终会复制所有元素。 Is there a way to build up a Series element-by-element, and then build a DataFrame from those?有没有办法逐个元素地构建一个Series ,然后从中构建一个DataFrame

I will actually be building up several DataFrames in parallel using Rayon, then combining them, but it looks like vstack does what I want there.实际上,我将使用 Rayon 并行构建多个 DataFrame,然后将它们组合起来,但看起来 vstack 在那里做了我想要的。 It's the creation of the individual DataFrames that I can't find out how to do efficiently.这是我无法找到如何有效执行的单个 DataFrame 的创建。

I did look at the source of the CSV parser but that is very complicated, and probably highly optimised, but is there a simple approach that is still reasonably efficient?我确实查看了 CSV 解析器的源代码,但它非常复杂,而且可能经过高度优化,但是是否有一种仍然相当有效的简单方法?

pub fn from_vec(
    name: &str,
    v: Vec<<T as PolarsNumericType>::Native, Global>
) -> ChunkedArray<T>

Create a new ChunkedArray by taking ownership of the Vec.通过获取 Vec 的所有权创建一个新的 ChunkedArray。 This operation is zero copy.这个操作是零拷贝。

here is the link. 是链接。 You can then call into_series on it.然后,您可以在其上调用into_series

The simplest, if perhaps not the most performant, answer is to just maintain a map of vectors and turn them into the series that get fed to a DataFrame all at once.最简单的,如果不是最高效的,答案是只维护一个 map 的向量,并将它们转换成序列,然后一次性全部馈送到 DataFrame。

let columns = BTreeMap::new();
for datum in get_data_from_web() {
    // For simplicity suppose datum is itself a BTreeMap
    // (More likely it's a serde_json::Value)
    // It's assumed that every datum has the same keys; if not, the 
    // Vecs won't have the same length
    // It's also assumed that the values of datum are all of the same known type

    for (k, v) in datum {
        columns.entry(k).or_insert(vec![]).push(v);
    }
}

let df = DataFrame::new(
    columns.into_iter()
        .map(|(name, values)| Series::new(name, values))
        .collect::<Vec<_>>()
    ).unwrap();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM