简体   繁体   中英

Chinese Text for H2O DataFrame in Python

I have a utf-8 encoded csv file with Chinese text. When I tried to import as an h2o dataframe, the data is improperly displayed as gibberish.

 dataframe = h2o.import_file('test.csv')

In the resulting dataframe, the column names are correct, but instead of Chinese text, it displays text like this:

 在ç�¡è¦ºäº†ä½ 知é�

I looked into h2o documentation and there doesn't seem to be any way to set an encoding option like in pandas when using import_file. Further, when running the following:

testing = ['你','好','嗎']
h2o.H2OFrame(testing)

it gives this error:

--------------------------------------------------------------------------
 UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-2-5f4b3eb49a84> in <module>
      1 testing = ['你','好','嗎']
----> 2 h2o.H2OFrame(testing)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\h2o\frame.py in __init__(self, python_obj, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns)
    104         if python_obj is not None:
    105             self._upload_python_object(python_obj, 
destination_frame, header, separator,
--> 106                                        column_names, 
column_types, na_strings, skipped_columns)
    107 
    108     @staticmethod

~\AppData\Local\Continuum\anaconda3\lib\site-packages\h2o\frame.py in _upload_python_object(self, python_obj, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns)
    143             csv_writer.writerow([row.get(k, None) for k in col_header])
    144         else:
--> 145             csv_writer.writerows(data_to_write)
    146         tmp_file.close()  # close the streams
    147         self._upload_parse(tmp_path, destination_frame, 1, 
separator, column_names, column_types, na_strings, skipped_columns)

~\AppData\Local\Continuum\anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
     17 class IncrementalEncoder(codecs.IncrementalEncoder):
     18     def encode(self, input, final=False):
---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
     20 
     21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode character '\u4f60' in position 1: character maps to <undefined>

Based on this error, it seems that cp1252 encoding is being used by h2o. Can someone offer help to have h2o import the csv file with Chinese to be in utf-8 encoding? Thank you.

The jira ticket in the comments has been resolved, and this parsing issue is no longer an issue with newer version of H2O. My recommendation would be to upgrade - for example if you upgrade to latest version of H2O you shouldn't have any issues.

I did a test with version 3.22.0.2 with your example and got:

In [6]: h2o.H2OFrame(testing)
Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%
Out[6]:
C1
----
你
好
嗎

[3 rows x 1 column]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM