简体   繁体   English

为什么 dask 在将 String 列设置为索引时会抛出错误?

[英]Why does dask throw an error when setting a String column as an index?

I'm reading a large CSV with dask, setting the dtypes as string and then setting it as an index:我正在阅读带有 dask 的大型 CSV ,将 dtypes 设置为字符串,然后将其设置为索引:

dataframe = dd.read_csv(file_path, dtype={"colName": "string"}, blocksize=100e6)
dataframe.set_index("colName")

and it throws the following error:并引发以下错误:

TypeError: Cannot interpret 'StringDtype' as a data type

Why does this happen?为什么会这样? How can I solve it?我该如何解决?

As stated in the bug report here for an unrelated issue:https://github.com/dask/dask/issues/7206#issuecomment-797221227正如这里的错误报告中所述,针对一个不相关的问题:https://github.com/dask/dask/issues/7206#issuecomment-797221227

When constructing the dask Array's meta object, we're currently assuming the underlying array type is a NumPy array, when in this case, it's actually going to be a pandas StringArray.在构建 dask 数组的元 object 时,我们目前假设底层数组类型是 NumPy 数组,在这种情况下,它实际上是 Z3A43B4F88325D94022C0EFA9C But unlike pandas, NumPy doesn't know how to handle a StringDtype.但与 pandas 不同,NumPy 不知道如何处理 StringDtype。

Currently, changing the column type to object from string solves the issue, but it's unclear if this is a bug or an expected behavior:目前,将列类型从字符串更改为 object 可以解决问题,但尚不清楚这是错误还是预期行为:

dataframe = dd.read_csv(file_path, dtype={"colName": "object"}, blocksize=100e6)
dataframe.set_index("colName")

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 当Numpy没有进行点积计算时,为什么Dask数组会抛出内存错误? - Why does Dask array throw memory error when Numpy doesn't on dot product calculation? 当我从csv文件的特定列中读取字符串并尝试将其解析为浮点数时,为什么Python会引发错误? - Why does Python throw an error when I read a string out of a particular column of a csv file and try to parse it into a floating point number? 当我使用 apply_along_axis 时,为什么会抛出 IndexError: index 1 is out of the bounds for axis 0 with size 1? - Why dask throw IndexError: index 1 is out of bounds for axis 0 with size 1 when I use apply_along_axis? 设置属性时引发错误 - Throw error when setting an attribute 在dask中将元素除以groupby的总和,而不为每列设置索引 - Divide element by sum of groupby in dask without setting index for every column Python:为什么将int与字符串进行比较的等式不会引发错误? - Python: Why does equality comparing an int with a string not throw an error? Dask groupby索引列 - Dask groupby index column 为什么Python在找不到子字符串时会抛出错误? - Why does Python throw an error when a substring is not found? 为什么 datetime.strptime 在使用 Django 运行时会抛出错误? - Why does datetime.strptime throw an error when run with Django? 当条件似乎满足时,为什么 geopandas 会抛出断言错误? - Why does geopandas throw an assertion error, when the conditions seems to be met?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM