I am trying to import my 3000 observation & 77 features .csv file as H2O dataframe (while I am on a Spark session):
(1st way)
# Convert pandas dataframe to H2O dataframe
import h2o
h2o.init()
data_train = h2o.import_file('/u/users/vn505f6/data.csv')
However, I am getting the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 102, in __init__
column_names, column_types, na_strings)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 143, in _upload_python_object
self._upload_parse(tmp_path, destination_frame, 1, separator, column_names, column_types, na_strings)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 319, in _upload_parse
self._parse(rawkey, destination_frame, header, sep, column_names, column_types, na_strings)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 326, in _parse
return self._parse_raw(setup)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 355, in _parse_raw
self._ex._cache.fill()
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/expr.py", line 346, in fill
res = h2o.api("GET " + endpoint % self._id, data=req_params)["frames"][0]
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 103, in api
return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/backend/connection.py", line 402, in request
return self._process_response(resp, save_to)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/backend/connection.py", line 725, in _process_response
raise H2OResponseError(data)
h2o.exceptions.H2OResponseError: Server error water.exceptions.H2OIllegalArgumentException:
Error: Unknown parameter: full_column_count
Request: GET /3/Frames/Key_Frame__upload_84df978b98892632a7ce19303c4440f3.hex
params: {u'row_offset': '0', u'row_count': '10', u'full_column_count': '-1', u'column_count': '-1', u'column_offset': '0'}
Let me notice that when I am doing this on my local machine then I am getting no error. I am getting the error above when I am doing the same thing on a Spark/Hadoop cluster.
Alternatively , I tried to do the following in the Spark cluster:
(2nd way)
from pysparkling import H2OContext
from ssat_utils.spark import SparkUtilities
import h2o
h2o_context = H2OContext.getOrCreate(SparkUtilities.spark)
data_train = h2o.import_file('/u/users/vn505f6/data.csv')
and then I got the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 414, in import_file
return H2OFrame()._import_parse(path, pattern, destination_frame, header, sep, col_names, col_types, na_strings)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 311, in _import_parse
rawkey = h2o.lazy_import(path, pattern)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 282, in lazy_import
return _import(path, pattern)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 291, in _import
if j["fails"]: raise ValueError("ImportFiles of " + path + " failed on " + str(j["fails"]))
ValueError: ImportFiles of /u/users/vn505f6/data.csv failed on [u'/u/users/vn505f6/data.csv']
The column names of the pandas dataframe are strings like the following: u_cnt_days_with_sale_14day
.
What is this error about and how can I fix this?
PS
These are the command line commands which create the Spark cluster/context:
SPARK_HOME=/u/users/******/spark-2.3.0 \
Q_CORE_LOC=/u/users/******/q-core \
ENV=local \
HIVE_HOME=/usr/hdp/current/hive-client \
SPARK2_HOME=/u/users/******/spark-2.3.0 \
HADOOP_CONF_DIR=/etc/hadoop/conf \
HIVE_CONF_DIR=/etc/hive/conf \
HDFS_PREFIX=hdfs:// \
PYTHONPATH=/u/users/******/q-core/python-lib:/u/users/******/three-queues/python-lib:/u/users/******/pyenv/prod_python_libs/lib/python2.7/site-packages/:$PYTHON_PATH \
YARN_HOME=/usr/hdp/current/hadoop-yarn-client \
SPARK_DIST_CLASSPATH=$(hadoop classpath):$(yarn classpath):/etc/hive/conf/hive-site.xml \
PYSPARK_PYTHON=/usr/bin/python2.7 \
QQQ_LOC=/u/users/******/three-queues \
/u/users/******/spark-2.3.0/bin/pyspark \
--master yarn \
--executor-memory 10g \
--num-executors 128 \
--executor-cores 10 \
--conf spark.port.maxRetries=80 \
--conf spark.dynamicAllocation.enabled=False \
--conf spark.default.parallelism=6000 \
--conf spark.sql.shuffle.partitions=6000 \
--principal ************************ \
--queue default \
--name interactive_H2O_MT \
--keytab /u/users/******/.******.keytab \
--driver-memory 10g
Finally what I did is firstly to import the .csv file as pandas dataframe and then to convert it to H2O dataframe:
from pysparkling import H2OContext
from ssat_utils.spark import SparkUtilities
import h2o
import pandas as pd
h2o_context = H2OContext.getOrCreate(SparkUtilities.spark)
data_train = pd.read_csv('/u/users/vn505f6/data.csv')
data_train = h2o.H2OFrame(data_train)
I do not really know why this worked while directly importing the .csv file as H2O dataframe in two different ways above my post did not work.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.