[英]Why does dask throw an error when setting a String column as an index?
I'm reading a large CSV with dask, setting the dtypes as string and then setting it as an index:我正在阅读带有 dask 的大型 CSV ,将 dtypes 设置为字符串,然后将其设置为索引:
dataframe = dd.read_csv(file_path, dtype={"colName": "string"}, blocksize=100e6)
dataframe.set_index("colName")
and it throws the following error:并引发以下错误:
TypeError: Cannot interpret 'StringDtype' as a data type
Why does this happen?为什么会这样? How can I solve it?
我该如何解决?
As stated in the bug report here for an unrelated issue:https://github.com/dask/dask/issues/7206#issuecomment-797221227正如这里的错误报告中所述,针对一个不相关的问题:https://github.com/dask/dask/issues/7206#issuecomment-797221227
When constructing the dask Array's meta object, we're currently assuming the underlying array type is a NumPy array, when in this case, it's actually going to be a pandas StringArray.
在构建 dask 数组的元 object 时,我们目前假设底层数组类型是 NumPy 数组,在这种情况下,它实际上是 Z3A43B4F88325D94022C0EFA9C But unlike pandas, NumPy doesn't know how to handle a StringDtype.
但与 pandas 不同,NumPy 不知道如何处理 StringDtype。
Currently, changing the column type to object from string solves the issue, but it's unclear if this is a bug or an expected behavior:目前,将列类型从字符串更改为 object 可以解决问题,但尚不清楚这是错误还是预期行为:
dataframe = dd.read_csv(file_path, dtype={"colName": "object"}, blocksize=100e6)
dataframe.set_index("colName")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.