简体   繁体   English

尝试使用python将kdb转换为csv,除一列外,所有内容均正确转换

[英]Trying to convert kdb to csv using python, everything converts correctly except one column

I have converted a kdb query into a dataframe and then uploaded that dataframe to a csv file. 我已经将kdb查询转换为数据框,然后将该数据框上载到csv文件。 This caused an encoding error which I easily fixed by decoding to utf-8. 这导致了编码错误,我可以通过解码为utf-8轻松修复该错误。 However, there is one column which this did not work for. 但是,只有一列对此不起作用。

"nameFid" is the column which isn't working correctly, it outputs on the CSV file as " b'STRING' " “ nameFid”是无法正常工作的列,它在CSV文件上输出为“ b'STRING'”

I am running Python 3.7, any other information needed I will be happy to provide. 我正在运行python 3.7,我将很乐意提供所需的任何其他信息。

Here is my code which decodes the data in the dataframe I get from kdb 这是我的代码,对我从kdb获得的数据帧中的数据进行解码

  for ba in df.dtypes.keys():
        if df.dtypes[ba] == 'O':
            try:
                df[ba] = df[ba].apply(lambda x: x.decode('UTF-8'))
            except Exception as e:
                print(e)
return df

This worked for every column except "nameFid" 这适用于除“ nameFid”之外的所有列

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 6: invalid continuation byte - UnicodeDecodeError:“ utf-8”编解码器无法解码位置6的字节0xdc:无效的继续字节-

This is one error I get but I thought this suggests that the data isn't encoded using UTF-8, which would surely mean all the columns wouldn't work? 这是我遇到的一个错误,但是我认为这表明数据不是使用UTF-8编码的,这肯定意味着所有列都无法正常工作吗?

When using the try except, it instead prints "'Series' object has no attribute 'decode'". 当使用tryexcept时,它会打印“'Series'对象没有属性'decode'”。

My goal is to remove the "b''" from the column values, which currently show 我的目标是从当前显示的列值中删除“ b”

" b'STRING' " “ b'STRING'”

I'm not sure what else i need to add. 我不确定还需要添加什么。 Let me know if you need anything. 你有任何需要都请告诉我。

Also sorry I am quite new to all of this. 也很抱歉,我对这一切还很陌生。

Many encodings are partially compatible from one other. 许多编码彼此部分兼容。 This is mostly due to the prevalence of ASCII so a ton of them will be backward compatible with ASCII but extend it differently. 这主要是由于ASCII的流行,因此其中的一大部分将与ASCII向后兼容,但扩展方式有所不同。 Hence if your other columns only contain stuff like numbers etc they are likely ASCII-only and will work with a lot of different encodings. 因此,如果您的其他列仅包含数字等内容,则它们可能仅是ASCII,并且可以使用许多不同的编码。

The column that raises an error however contains some character outside the normal ASCII range and thus the encoding starts to matter. 但是,引发错误的列包含正常ASCII范围之外的某些字符,因此编码开始变得重要。 If you don't know the encoding of the file you can use chardet to try to guess it. 如果您不知道文件的编码,则可以使用chardet尝试猜测它。 Keep in mind that this is just guessing . 请记住,这只是猜测 Decoding using a different encoding may not raise any error however it could result in the wrong characters appearing in the final text so you should always know which encoding to use. 使用其他编码进行解码可能不会引起任何错误,但是可能导致最终文本中出现错误的字符,因此您应该始终知道要使用哪种编码。

This said, if you are on Linux the standard file utility is often able to give you a rough guess of the encoding used, however for more advanced use cases something like chardet is necessary. 这就是说,如果您使用的是Linux,则标准file实用程序通常可以使您大致了解所使用的编码,但是对于更高级的用例,必须使用诸如chardet之类的东西。

Once you have found the correct encoding, say you found it is latin-1 simply replace the decode('utf-8') with decode('latin-1') . 找到正确的编码后,请说您发现它是latin-1只需将decode('utf-8')替换为decode('latin-1')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM