如何在 palantir-foundry 中导入和使用 Spark-Koalas

Question

How can I -- in Palantir-foundry -- import and use the "Koalas: pandas API for Apache Spark" open source python package. How can I -- in Palantir-foundry -- import and use the "Koalas: pandas API for Apache Spark" open source python package.

I know that you can import packages that don't exist through Code Repo and have done this, can I do this same process for Koalas package or do I need to follow another route?我知道您可以通过 Code Repo 导入不存在的包并完成此操作，我可以为 Koalas package 执行相同的过程还是需要遵循另一条路线？

Answer 1

I was able to use Code Repo to upload a local clone of the package and then add the package in platform using the steps detailed here: How to create python libraries and how to import it in palantir foundry我能够使用 Code Repo 上传 package 的本地克隆，然后使用此处详述的步骤在平台中添加 package：如何创建 Z23EEEB4347BDD26BFC6B7EE9A3B75antirDDZ 库以及如何将其导入到伙伴库中

However, shortly afterwards Palantir admins introduced an update which included the Koalas package as a native package to the platform.然而，不久之后，Palantir 管理员推出了一个更新，其中包括 Koalas package 作为平台的原生 package。 I have not however had time to try using this for any major tasks as of yet.但是，到目前为止，我还没有时间尝试将其用于任何主要任务。

Answer 2

Koalas is officially included in PySpark as **pandas API on Spark** in Apache Spark 3.2 . Koalas 在 Apache Spark 3.2 中作为 **pandas API on Spark** 正式包含在 PySpark 中。 In Spark 3.2+, you no longer need to import koalas, as it comes with pyspark.在 Spark 3.2+ 中，您不再需要导入考拉，因为它附带了 pyspark。 The only required action is to add pandas and pyarrow as these are required dependencies that Code Repositories don't include by default.唯一需要的操作是添加 pandas 和 pyarrow，因为这些是代码存储库默认不包含的必需依赖项。 You can do so via Libraries tab.您可以通过库选项卡执行此操作。

You can confirm that it works using this test transform您可以使用此测试转换确认它是否有效

@transform_df(
    Output("OUTPUT_DATASET_PATH"),
)
def compute():
    import pyspark.pandas as ps
    psdf = ps.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]},
    index=[10, 20, 30, 40, 50, 60])
    return psdf.to_spark()

To confirm that you are using Spark 3.2+ in your Code repository, please merge any pending upgrade PRs.要确认您在代码存储库中使用的是 Spark 3.2+，请合并任何待处理的升级 PR。 Prior to Spark 3.2, it was possible to import koalas through Libraries tab在 Spark 3.2 之前，可以通过 Libraries 选项卡导入考拉

如何在 palantir-foundry 中导入和使用 Spark-Koalas

问题描述

2 个解决方案

解决方案1
3 已采纳 2021-04-16 05:52:26

解决方案2
1 2022-09-06 13:39:23

如何在 palantir-foundry 中导入和使用 Spark-Koalas

问题描述

2 个解决方案

解决方案1 3 已采纳 2021-04-16 05:52:26

解决方案2 1 2022-09-06 13:39:23

解决方案1
3 已采纳 2021-04-16 05:52:26

解决方案2
1 2022-09-06 13:39:23