对于 SPARK 列，将 SCALA ===（三等）转换为 Python

Question

I have a following code in Scala for Python conversion我在 Scala 中有以下代码用于 Python 转换


import org.apache.spark.sql.functions.{col, lit}
import org.apache.spark.sql.{Column, DataFrame, Dataset}

object SearchTermReader {

  def read(
    searchTermsInputTable: DataFrame,
    brand: String,
    posa: String,
    startDate: String,
    endDate: String
  ): Dataset[SearchTerm] = {

    import searchTermsInputTable.sparkSession.implicits._

    val conditionsNoEndDate = getConditions(brand, posa, startDate)
    val searchTermsNoEndDate = searchTermsInputTable
      .where(conditionsNoEndDate)
      .cache()

    searchTermsNoEndDate.count()

    val columnNames = SparkExtensions.getColumns[SearchTerm]

    searchTermsNoEndDate
      .filter(col("report_date").leq(lit(endDate)))
      .select(columnNames: _*)
      .as[SearchTerm]
  }

  def getConditions(
    brand: String,
    posa: String,
    startDate: String
  ): Column = {

    val filterByBrandCondition: Column = {
      if (brand.equals("")) {
        lit(true)
      } else {
        col("brand") === brand
      }
    }
    val filterByPosaCondition: Column = {
      if (posa.equals("")) {
        lit(true)
      } else {
        col("account_name").rlike(getAccountPattern(posa))
      }
    }

    filterByBrandCondition &&
      filterByPosaCondition &&
      col("search_engine") === "GOOGLE" &&
      col("impressions") > 0 &&
      col("report_date").geq(lit(startDate))
  }

  def getAccountPattern(countryCodes: String): String = {
    countryCodes.split(",").map(cc => s":G:$cc:").mkString("|")
  }
}

Seems to be two issues here for straight conversion.对于直接转换，这里似乎有两个问题。

Dataset is used which is not supported by Pyspark使用了 Pyspark 不支持的数据集
=== is used for Column which is also not supported === 用于也不支持的 Column

How I can overcome this and convert it to Python??我如何克服这个问题并将其转换为 Python？

Answer 1

If you are referring to column of the dataframe then you can use it like below.如果您指的是 dataframe 的列，那么您可以像下面这样使用它。

df.filter((col("brand") == "BRAND") & (...))

Answer 2

Pyspark doesn't support using === just as Scala . Pyspark 不支持使用===就像Scala 。

In Scala, the == is using the equals methods which checks if the two references point to the same object. The definition of === depends on the context/object.在 Scala 中， ==使用 equals 方法检查两个引用是否指向同一个 object。 ===的定义取决于上下文/对象。 For Spark, === is using the equalTo method.对于 Spark， ===使用的是equalTo方法。

In Pyspark, you make use of = or == .在 Pyspark 中，您使用=或== 。 Having said that, in Pyspark you do following implementations to get same result per your Scala code -话虽如此，在 Pyspark 中，您按照Scala代码执行以下实现以获得相同的结果 -

df.filter("Brand = 'BRAND'")

Or,或者，

df.filter(df.Brand == 'BRAND')

Or,或者，

df.filter(df["Brand"] == 'BRAND')

Or,或者，

from pyspark.sql.functions import *
df.filter(col("Brand") == 'BRAND')

Answer 3

After further investigation how to convert === in Scala to python for those particular cases from above its enough to use == in Python在进一步调查如何将 Scala 中的 === 转换为 python 之后，对于上面的那些特殊情况，它足以在 Python 中使用 ==

so所以

val filterByBrandCondition: Column = {
      if (brand.equals("")) {
        lit(true)
      } else {
        col("brand") === brand
      }
    }

converts to转换为

if (brand == ""):
      filterByBrandCondition: Column = lit(True)
    else:
      filterByBrandCondition: Column = (col("brand") == brand)

and和

col("search_engine") === "GOOGLE"

to到

col("search_engine") == "GOOGLE"

Using of Dataset can be replaced by DataFrame. Variable Dataset的使用可以用DataFrame代替。变量

val columnNames = SparkExtensions.getColumns[SearchTerm]

needs to be replaced by code which will read column names from dataframe需要替换为将从 dataframe 读取列名称的代码

对于 SPARK 列，将 SCALA ===（三等）转换为 Python

问题描述

3 个解决方案

解决方案1
3 2022-03-18 11:46:14

解决方案2
1 2022-03-18 14:14:09

解决方案3
0 2022-04-08 09:13:46

对于 SPARK 列，将 SCALA ===（三等）转换为 Python

问题描述

3 个解决方案

解决方案1 3 2022-03-18 11:46:14

解决方案2 1 2022-03-18 14:14:09

解决方案3 0 2022-04-08 09:13:46

解决方案1
3 2022-03-18 11:46:14

解决方案2
1 2022-03-18 14:14:09

解决方案3
0 2022-04-08 09:13:46