简体   繁体   English

如何在 dataframe 中使用 sparkSession 使用 spark-cassandra-connector 在 pyspark 中写入

[英]how to use sparkSession in dataframe write in pyspark using spark-cassandra-connector

I am using pyspark and spark-cassandra-connector_2.11-2.3.0.jar with cassandra DB.我正在使用pysparkspark-cassandra-connector_2.11-2.3.0.jar和 cassandra DB。 I am reading dataframe from one keyspace and writing to another different keyspace.我正在从一个键空间读取 dataframe 并写入另一个不同的键空间。 This two keyspace have different username and password.这两个键空间有不同的用户名和密码。

I created sparkSession using:我使用以下方法创建了 sparkSession:

spark_session = None

def set_up_spark(sparkconf,config):
    """
    sets up spark configuration and create a session
    :return: None
    """
    try:
        logger.info("spark conf set up Started")
        global spark_session
        spark_conf = SparkConf()
        for key, val in sparkconf.items():
            spark_conf.set(key, val)
        spark_session = SparkSession.builder.config(conf=spark_conf).getOrCreate()
        logger.info("spark conf set up Completed")
    except Exception as e:
        raise e

I used this sparkSession to read data as dataframe as:我使用此 sparkSession 将数据读取为 dataframe 为:

table_df = spark_session.read \
            .format("org.apache.spark.sql.cassandra") \
            .options(table=table_name, keyspace=keyspace_name) \
            .load()

I am able to read data using the above session.我可以使用上面的 session 读取数据。 spark_session is attached to above query. spark_session 附加到上述查询。

Now I need to create another session since the credentials for write table is different.现在我需要创建另一个 session 因为写入表的凭据不同。 I have the write query as:我的写查询为:

table_df.write \
            .format("org.apache.spark.sql.cassandra") \
            .options(table=table_name, keyspace=keyspace_name) \
            .mode("append") \
            .save()

I couldn't find how to attach a new sparkSession for the above write operation in cassandra.我在 cassandra 中找不到如何为上述写入操作附加新的 sparkSession。

How Do I attach a new SparkSession for write operation in pyspark with spark-cassandra-connector?如何使用 spark-cassandra-connector 在 pyspark 中附加新的 SparkSession 以进行写入操作?

You can simply pass that information as options to specific read or write operation, this includes things like: spark.cassandra.connection.host ,您可以简单地将这些信息作为选项传递给特定的readwrite操作,其中包括: spark.cassandra.connection.host

Please note, that you'll need to put these options into a dictionary, and pass this dictionary instead of passing directly, like described in documentation .请注意,您需要将这些选项放入字典中,并传递此字典而不是直接传递,如文档中所述。

read_options = { "table": "..", "keyspace": "..", 
  "spark.cassandra.connection.host": "IP1", 
  "spark.cassandra.auth.username": "username1", 
  "spark.cassandra.auth.password":"password1"}
table_df = spark_session.read \
            .format("org.apache.spark.sql.cassandra") \
            .options(**read_options) \
            .load()

write_options = { "table": "..", "keyspace": "..", 
  "spark.cassandra.connection.host": "IP2", 
  "spark.cassandra.auth.username": "username2", 
  "spark.cassandra.auth.password":"password1"}
table_df.write \
            .format("org.apache.spark.sql.cassandra") \
            .options(**write_options) \
            .mode("append") \
            .save()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM