为什么 dask 在将 String 列设置为索引时会抛出错误？

Question

I'm reading a large CSV with dask, setting the dtypes as string and then setting it as an index:我正在阅读带有 dask 的大型 CSV ，将 dtypes 设置为字符串，然后将其设置为索引：

dataframe = dd.read_csv(file_path, dtype={"colName": "string"}, blocksize=100e6)
dataframe.set_index("colName")

and it throws the following error:并引发以下错误：

TypeError: Cannot interpret 'StringDtype' as a data type

Why does this happen?为什么会这样？ How can I solve it?我该如何解决？

Answer 1

As stated in the bug report here for an unrelated issue:https://github.com/dask/dask/issues/7206#issuecomment-797221227正如这里的错误报告中所述，针对一个不相关的问题：https://github.com/dask/dask/issues/7206#issuecomment-797221227

When constructing the dask Array's meta object, we're currently assuming the underlying array type is a NumPy array, when in this case, it's actually going to be a pandas StringArray.在构建 dask 数组的元 object 时，我们目前假设底层数组类型是 NumPy 数组，在这种情况下，它实际上是 Z3A43B4F88325D94022C0EFA9C But unlike pandas, NumPy doesn't know how to handle a StringDtype.但与 pandas 不同，NumPy 不知道如何处理 StringDtype。

Currently, changing the column type to object from string solves the issue, but it's unclear if this is a bug or an expected behavior:目前，将列类型从字符串更改为 object 可以解决问题，但尚不清楚这是错误还是预期行为：

dataframe = dd.read_csv(file_path, dtype={"colName": "object"}, blocksize=100e6)
dataframe.set_index("colName")

为什么 dask 在将 String 列设置为索引时会抛出错误？

问题描述

1 个解决方案

解决方案1
0 2021-04-08 16:07:05

为什么 dask 在将 String 列设置为索引时会抛出错误？

问题描述

1 个解决方案

解决方案1 0 2021-04-08 16:07:05

解决方案1
0 2021-04-08 16:07:05