簡體   English   中英

無法將.csv文件導入為H2O數據框

[英]Cannot import .csv file as H2O dataframe

我正在嘗試將我的3000個觀測值和77個功能.csv文件導入為H2O數據幀(在Spark會話中):

(第一種方式)

# Convert pandas dataframe to H2O dataframe
import h2o
h2o.init()
data_train = h2o.import_file('/u/users/vn505f6/data.csv')

但是,我收到以下錯誤:

   Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 102, in __init__
    column_names, column_types, na_strings)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 143, in _upload_python_object
    self._upload_parse(tmp_path, destination_frame, 1, separator, column_names, column_types, na_strings)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 319, in _upload_parse
    self._parse(rawkey, destination_frame, header, sep, column_names, column_types, na_strings)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 326, in _parse
    return self._parse_raw(setup)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 355, in _parse_raw
    self._ex._cache.fill()
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/expr.py", line 346, in fill
    res = h2o.api("GET " + endpoint % self._id, data=req_params)["frames"][0]
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 103, in api
    return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/backend/connection.py", line 402, in request
    return self._process_response(resp, save_to)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/backend/connection.py", line 725, in _process_response
    raise H2OResponseError(data)
h2o.exceptions.H2OResponseError: Server error water.exceptions.H2OIllegalArgumentException:
  Error: Unknown parameter: full_column_count
  Request: GET /3/Frames/Key_Frame__upload_84df978b98892632a7ce19303c4440f3.hex
    params: {u'row_offset': '0', u'row_count': '10', u'full_column_count': '-1', u'column_count': '-1', u'column_offset': '0'}

讓我注意到,當我在本地計算機上執行此操作時,沒有任何錯誤。 在Spark / Hadoop集群上執行相同操作時,出現上述錯誤。

另外,我嘗試在Spark集群中執行以下操作:

(第二種方式)

from pysparkling import H2OContext
from ssat_utils.spark import SparkUtilities
import h2o

h2o_context = H2OContext.getOrCreate(SparkUtilities.spark)
data_train = h2o.import_file('/u/users/vn505f6/data.csv')

然后出現以下錯誤:

Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 414, in import_file
   return H2OFrame()._import_parse(path, pattern, destination_frame, header, sep, col_names, col_types, na_strings)
 File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 311, in _import_parse
   rawkey = h2o.lazy_import(path, pattern)
 File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 282, in lazy_import
   return _import(path, pattern)
 File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 291, in _import
   if j["fails"]: raise ValueError("ImportFiles of " + path + " failed on " + str(j["fails"]))
ValueError: ImportFiles of /u/users/vn505f6/data.csv failed on [u'/u/users/vn505f6/data.csv']

pandas數據u_cnt_days_with_sale_14day的列名稱是類似於以下字符串: u_cnt_days_with_sale_14day

這是什么錯誤,我該如何解決?

PS

這些是創建Spark集群/上下文的命令行命令:

SPARK_HOME=/u/users/******/spark-2.3.0 \
Q_CORE_LOC=/u/users/******/q-core \
ENV=local \
HIVE_HOME=/usr/hdp/current/hive-client \
SPARK2_HOME=/u/users/******/spark-2.3.0 \
HADOOP_CONF_DIR=/etc/hadoop/conf \
HIVE_CONF_DIR=/etc/hive/conf \
HDFS_PREFIX=hdfs:// \
PYTHONPATH=/u/users/******/q-core/python-lib:/u/users/******/three-queues/python-lib:/u/users/******/pyenv/prod_python_libs/lib/python2.7/site-packages/:$PYTHON_PATH \
YARN_HOME=/usr/hdp/current/hadoop-yarn-client \
SPARK_DIST_CLASSPATH=$(hadoop classpath):$(yarn classpath):/etc/hive/conf/hive-site.xml \
PYSPARK_PYTHON=/usr/bin/python2.7 \
QQQ_LOC=/u/users/******/three-queues \
/u/users/******/spark-2.3.0/bin/pyspark \
--master yarn \
--executor-memory 10g \
--num-executors 128 \
--executor-cores 10 \
--conf spark.port.maxRetries=80 \
--conf spark.dynamicAllocation.enabled=False \
--conf spark.default.parallelism=6000 \
--conf spark.sql.shuffle.partitions=6000 \
--principal ************************ \
--queue default \
--name interactive_H2O_MT \
--keytab /u/users/******/.******.keytab \
--driver-memory 10g

最后,我要做的是首先將.csv文件作為pandas數據框導入,然后將其轉換為H2O數據框:

from pysparkling import H2OContext
from ssat_utils.spark import SparkUtilities
import h2o
import pandas as pd

h2o_context = H2OContext.getOrCreate(SparkUtilities.spark)
data_train = pd.read_csv('/u/users/vn505f6/data.csv')
data_train = h2o.H2OFrame(data_train)

我真的不知道為什么這能起作用,同時以我的帖子上方的兩種不同方式直接將.csv文件作為H2O數據框導入是行不通的。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM