简体   繁体   English

TypeError在Pyspark中将Pandas数据框转换为Spark数据框

[英]TypeError converting a Pandas Dataframe to Spark Dataframe in Pyspark

Did my research, but didn't find anything on this. 我做了研究,但没有找到任何东西。 I want to convert a simple pandas.DataFrame to a spark dataframe, like this: 我想将简单的pandas.DataFrame转换为spark数据pandas.DataFrame ,如下所示:

df = pd.DataFrame({'col1': ['a', 'b', 'c'], 'col2': [1, 2, 3]})
sc_sql.createDataFrame(df, schema=df.columns.tolist()) 

The error I get is: 我得到的错误是:

TypeError: Can not infer schema for type: <class 'str'>

I tried something even simpler: 我尝试了一些更简单的方法:

df = pd.DataFrame([1, 2, 3])
sc_sql.createDataFrame(df)

And I get: 我得到:

TypeError: Can not infer schema for type: <class 'numpy.int64'>

Any help? 有什么帮助吗? Do manually need to specify a schema or so? 是否需要手动指定架构?

sc_sql is a pyspark.sql.SQLContext , I am in a jupyter notebook on python 3.4 and spark 1.6. sc_sqlpyspark.sql.SQLContext ,我在python 3.4和spark 1.6上的jupyter笔记本中。

Thanks! 谢谢!

It's related to your spark version, latest update of spark makes type inference more intelligent. 它与您的spark版本有关,最新的spark更新使类型推断更加智能。 You could have fixed this by adding the schema like this : 您可以通过添加以下模式来解决此问题:

mySchema = StructType([ StructField("col1", StringType(), True), StructField("col2", IntegerType(), True)])
sc_sql.createDataFrame(df,schema=mySchema)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM