简体   繁体   中英

Cannot import .csv file as H2O dataframe

I am trying to import my 3000 observation & 77 features .csv file as H2O dataframe (while I am on a Spark session):

(1st way)

# Convert pandas dataframe to H2O dataframe
import h2o
h2o.init()
data_train = h2o.import_file('/u/users/vn505f6/data.csv')

However, I am getting the following error:

   Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 102, in __init__
    column_names, column_types, na_strings)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 143, in _upload_python_object
    self._upload_parse(tmp_path, destination_frame, 1, separator, column_names, column_types, na_strings)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 319, in _upload_parse
    self._parse(rawkey, destination_frame, header, sep, column_names, column_types, na_strings)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 326, in _parse
    return self._parse_raw(setup)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 355, in _parse_raw
    self._ex._cache.fill()
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/expr.py", line 346, in fill
    res = h2o.api("GET " + endpoint % self._id, data=req_params)["frames"][0]
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 103, in api
    return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/backend/connection.py", line 402, in request
    return self._process_response(resp, save_to)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/backend/connection.py", line 725, in _process_response
    raise H2OResponseError(data)
h2o.exceptions.H2OResponseError: Server error water.exceptions.H2OIllegalArgumentException:
  Error: Unknown parameter: full_column_count
  Request: GET /3/Frames/Key_Frame__upload_84df978b98892632a7ce19303c4440f3.hex
    params: {u'row_offset': '0', u'row_count': '10', u'full_column_count': '-1', u'column_count': '-1', u'column_offset': '0'}

Let me notice that when I am doing this on my local machine then I am getting no error. I am getting the error above when I am doing the same thing on a Spark/Hadoop cluster.

Alternatively , I tried to do the following in the Spark cluster:

(2nd way)

from pysparkling import H2OContext
from ssat_utils.spark import SparkUtilities
import h2o

h2o_context = H2OContext.getOrCreate(SparkUtilities.spark)
data_train = h2o.import_file('/u/users/vn505f6/data.csv')

and then I got the following error:

Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 414, in import_file
   return H2OFrame()._import_parse(path, pattern, destination_frame, header, sep, col_names, col_types, na_strings)
 File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 311, in _import_parse
   rawkey = h2o.lazy_import(path, pattern)
 File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 282, in lazy_import
   return _import(path, pattern)
 File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 291, in _import
   if j["fails"]: raise ValueError("ImportFiles of " + path + " failed on " + str(j["fails"]))
ValueError: ImportFiles of /u/users/vn505f6/data.csv failed on [u'/u/users/vn505f6/data.csv']

The column names of the pandas dataframe are strings like the following: u_cnt_days_with_sale_14day .

What is this error about and how can I fix this?

PS

These are the command line commands which create the Spark cluster/context:

SPARK_HOME=/u/users/******/spark-2.3.0 \
Q_CORE_LOC=/u/users/******/q-core \
ENV=local \
HIVE_HOME=/usr/hdp/current/hive-client \
SPARK2_HOME=/u/users/******/spark-2.3.0 \
HADOOP_CONF_DIR=/etc/hadoop/conf \
HIVE_CONF_DIR=/etc/hive/conf \
HDFS_PREFIX=hdfs:// \
PYTHONPATH=/u/users/******/q-core/python-lib:/u/users/******/three-queues/python-lib:/u/users/******/pyenv/prod_python_libs/lib/python2.7/site-packages/:$PYTHON_PATH \
YARN_HOME=/usr/hdp/current/hadoop-yarn-client \
SPARK_DIST_CLASSPATH=$(hadoop classpath):$(yarn classpath):/etc/hive/conf/hive-site.xml \
PYSPARK_PYTHON=/usr/bin/python2.7 \
QQQ_LOC=/u/users/******/three-queues \
/u/users/******/spark-2.3.0/bin/pyspark \
--master yarn \
--executor-memory 10g \
--num-executors 128 \
--executor-cores 10 \
--conf spark.port.maxRetries=80 \
--conf spark.dynamicAllocation.enabled=False \
--conf spark.default.parallelism=6000 \
--conf spark.sql.shuffle.partitions=6000 \
--principal ************************ \
--queue default \
--name interactive_H2O_MT \
--keytab /u/users/******/.******.keytab \
--driver-memory 10g

Finally what I did is firstly to import the .csv file as pandas dataframe and then to convert it to H2O dataframe:

from pysparkling import H2OContext
from ssat_utils.spark import SparkUtilities
import h2o
import pandas as pd

h2o_context = H2OContext.getOrCreate(SparkUtilities.spark)
data_train = pd.read_csv('/u/users/vn505f6/data.csv')
data_train = h2o.H2OFrame(data_train)

I do not really know why this worked while directly importing the .csv file as H2O dataframe in two different ways above my post did not work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM