简体   繁体   English

python可写的文件格式,在Spark中可作为Dataframe读取

[英]A file format writable by python, readable as a Dataframe in Spark

I have python scripts (no Spark here) producing some data files, that I want to be readable easily as Dataframes in a scala/spark application. 我有python脚本(这里没有Spark)生成一些数据文件,我想像scala / spark应用程序中的Dataframes一样容易阅读。

What's the best choice ? 最佳选择是什么?

If your data doesn't have newlines in then a simple text-based format such as TSV is probably best. 如果您的数据中没有换行符,那么最好使用诸如TSV之类的基于文本的简单格式。

If you need to include binary data then a separated format like protobuf makes sense - anything for which a hadoop InputFormat exists should be fine. 如果您需要包括二进制数据,那么像protobuf这样的单独格式就很有意义-存在hadoop InputFormat的任何内容都可以。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM