简体   繁体   English

在 PySpark 中使用 toDF() 函数从 RDD 转换为 Dataframe 时的奇怪行为

[英]Strange behavior when using toDF() function to transfrom RDD to Dataframe in PySpark

I am new in Spark.我是 Spark 的新手。 And when I use toDF() function to convert RDD to dataframe, it seems to compute all the transformation function like map() I've written before.当我使用 toDF() 函数将 RDD 转换为数据帧时,它似乎计算了所有转换函数,如我之前编写的 map() 。 I wonder if toDF() in PySpark is a transformation or an action.我想知道 PySpark 中的 toDF() 是转换还是动作。

I create a simple RDD and use a simple function to output its value, just for test, And use toDF() after map().我创建了一个简单的 RDD 并使用一个简单的函数来输出它的值,仅用于测试,并在 map() 之后使用 toDF()。 The result seems to run the function in map partially.结果似乎部分地运行了 map 中的函数。 And when I show the result of dataframe, toDF() act like transformation and output the result again.当我显示数据帧的结果时, toDF() 就像转换并再次输出结果。

>>> a = sc.parallelize([(1,),(2,),(3,)])
>>> def f(x):
...     print(x[0])
...     return (x[0] + 1, )
...
>>> b = a.map(f).toDF(["id"])
2
1
>>> b = a.map(f).toDF(["id"]).show()
2
1
1
2
3
+---+
| id|
+---+
|  2|
|  3|
|  4|
+---+

Could someone tell me why toDF() function in PySpark act both like action and transformation?有人能告诉我为什么 PySpark 中的 toDF() 函数既像动作又像转换吗? Thanks a lot.非常感谢。

PS: In Scala, toDF act like transformation in my case. PS:在 Scala 中,toDF 在我的情况下就像转换一样。

That's not strange.这并不奇怪。 Since you didn't provide the schema, Spark has to infer it based on the data.由于您没有提供架构,Spark 必须根据数据推断它。 If the RDD is an input, it will call SparkSession._createFromRDD and subsequently SparkSession._inferSchema , which, if samplingRatio is missing, will evaluate up to 100 row :如果RDD是输入,它会调用SparkSession._createFromRDD随后SparkSession._inferSchema ,其中,如果samplingRatio丢失, 将评估高达100行

first = rdd.first()
if not first:
    raise ValueError("The first row in RDD is empty, "
                     "can not infer schema")
if type(first) is dict:
    warnings.warn("Using RDD of dict to inferSchema is deprecated. "
                  "Use pyspark.sql.Row instead")


if samplingRatio is None:
    schema = _infer_schema(first, names=names)
    if _has_nulltype(schema):
        for row in rdd.take(100)[1:]:
            schema = _merge_type(schema, _infer_schema(row, names=names))
            if not _has_nulltype(schema):
                break
        else:
            raise ValueError("Some of types cannot be determined by the "
                             "first 100 rows, please try again with sampling")

Now the only puzzle left if why it doesn't evaluate exactly one record.现在剩下的唯一难题是为什么它不能准确评估一条记录。 After-all in your case first is not empty and doesn't contain None .毕竟在你的情况下first不是空的并且不包含None

That's because first is implemented through take and doesn't guarantee that the exact number of items will evaluated.这是因为first是通过take实现的,并不能保证将评估的项目的确切数量。 If the first partition doesn't yield the required number of items, it will iteratively increase number of partitions to scan.如果第一个分区没有产生所需数量的项目,它将迭代地增加要扫描的分区数量。 Please check the implementation for details.详情请查看实现

If you want to avoid this you should use createDataFrame and provide schema either as DDL string:如果你想避免这种情况,你应该使用createDataFrame并提供模式作为 DDL 字符串:

spark.createDataFrame(a.map(f), "val: integer")

or equivalent StructType .或等效的StructType

You won't find any similar behavior in Scala counterpart, because it doesn't use schema inference in toDF .您不会在 Scala 中找到任何类似的行为,因为它在toDF中不使用模式推断。 It either retrieves corresponding schema from the Encoder (which is fetched using Scala reflection), or doesn't allow conversion at all.它要么从Encoder检索相应的模式(使用 Scala 反射获取),要么根本不允许转换。 The closest similar behavior is inference on input source like CSV or JSON :最接近的类似行为是对输入源(如 CSV或 JSON )的推断:

spark.read.json(Seq("""{"foo": "bar"}""").toDS.map(x => { println(x); x }))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM