使用Spark读取HBase表

Question

我有一个带有216列的“瞪羚”表，并且我想在javaPairRDD中获取其中的一些列。 我尝试按照此链接进行操作：

如何使用Spark和此方法从HBase读取：如何从Spark中的HBase表中获取所有数据

为了导入所有jar，我需要将此依赖项添加到我的pom文件中：

'<?xml version="1.0" encoding="UTF-8"?>

http://maven.apache.org/xsd/maven-4.0.0.xsd“> 4.0.0

<groupId>fr.aid.cim</groupId>
<artifactId>spark-poc</artifactId>
<version>1.0-SNAPSHOT</version>


<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>1.1.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hbase</groupId>
        <artifactId>hbase-client</artifactId>
        <version>0.96.0-hadoop2</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hbase</groupId>
        <artifactId>hbase</artifactId>
        <version>0.20.6</version>
    </dependency>
</dependencies>


</project>'

这是我的代码：

'SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount");
    JavaSparkContext ctx = new JavaSparkContext(sparkConf);
    //JavaSQLContext jsql = new JavaSQLContext(sc);
    //test hbase table
    HBaseConfiguration conf = new HBaseConfiguration();
    conf.set("hbase.zookeeper.quorum", "192.168.10.32");
    conf.set("hbase.zookeeper.property.clientPort","2181");
    conf.set("hbase.master", "192.168.10.32" + ":60000");
    conf.set("hbase.cluster.distributed", "true");
    conf.set("hbase.rootdir", "hdfs://localhost:8020/hbase");

    //conf.set(TableInputFormat.INPUT_TABLE, "gazelle_hive4");
    String tableName = "gazelle_hbase4";
    HTable table = new HTable(conf,tableName);
    JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = ctx
            .newAPIHadoopRDD(
                    conf,
                    TableInputFormat.class,,
                    org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
                    org.apache.hadoop.hbase.client.Result.class);
    hBaseRDD.coalesce(1, true).saveAsTextFile(path + "hBaseRDD");'

但是我对“ TableInputFormat”有疑问

错误：无法解析符号TableInputFormat。 是我应该导入的另一个库还是应该添加的其他依赖项？

注意：我尚未创建任何XML文件。 我应该创建“ hbase-default.xml”和“ hbase-site.xml”吗？ 如果是，怎么办？

预先感谢您的帮助。

Answer 1

根据Apache Spark用户列表中的该线程，您可能还需要一些其他东西。

如果在运行时发生错误，则应为Spark明确指定hbase jar。

spark-submit --driver-class-path $(hbase classpath) --jars /usr/lib/hbase/hbase-server.jar,/usr/lib/hbase/hbase-client.jar,/usr/lib/hbase/hbase-common.jar,/usr/lib/hbase/hbase-protocol.jar,/usr/lib/hbase/lib/protobuf-java-2.5.0.jar,/usr/lib/hbase/lib/htrace-core.jar --class YourClassName --master local App.jar

如果在编译时发生错误，则可能缺少依赖项。 （如线程中所述的hbase-server。）

使用Spark读取HBase表

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-11-25 13:48:33

使用Spark读取HBase表

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-11-25 13:48:33

解决方案1
1 已采纳 2014-11-25 13:48:33