PySpark：TypeError：StructType不能接受类型的对象 <type 'unicode'> 要么 <type 'str'>

Question

I am reading data from a CSV file and then creating a DataFrame. 我正在从CSV文件中读取数据，然后创建一个DataFrame。 But when I try to access the data in the DataFrame I get TypeError. 但是，当我尝试访问DataFrame中的数据时，出现TypeError。

fields = [StructField(field_name, StringType(), True) for field_name in schema.split(',')]
schema = StructType(fields)

input_dataframe = sql_context.createDataFrame(input_data_1, schema)

print input_dataframe.filter(input_dataframe.diagnosis_code == '11').count()

Both 'unicode' and 'str' are not working with Spark DataFrame. 'unicode'和'str'均不适用于Spark DataFrame。 I get the below TypeError: 我得到以下TypeError：

TypeError: StructType can not accept object in type TypeError: StructType can not accept object in type TypeError：StructType不能接受类型的对象TypeError：StructType不能接受类型的对象

I tried encoding in 'utf-8' as below but still get the error but now complaining about TypeError with 'str': 我尝试按如下所示在“ utf-8”中进行编码，但仍然收到错误，但现在抱怨带有“ str”的TypeError：

input_data_2 = input_data_1.map(lambda x: x.encode("utf-8"))
input_dataframe = sql_context.createDataFrame(input_data_2, schema)

print input_dataframe.filter(input_dataframe.diagnosis_code == '410.11').count()

I also tried parsing the CSV directly as utf-8 or unicode using the param use_unicode=True/False 我还尝试使用参数use_unicode = True / False将CSV直接解析为utf-8或Unicode

Answer 1

Reading between the lines. 字里行间的阅读。 You are 你是

reading data from a CSV file 从CSV文件读取数据

and get 并得到

TypeError: StructType can not accept object in type <type 'unicode'>

This happens because you pass a string not an object compatible with struct. 发生这种情况是因为您传递的字符串不是与struct兼容的对象。 Probably you pass data like: 可能您传递如下数据：

input_data_1 = sc.parallelize(["1,foo,2", "2,bar,3"])

and schema 和模式

schema = "x,y,z"

fields = [StructField(field_name, StringType(), True) for field_name in schema.split(',')]
schema = StructType(fields)

and you expect Spark to figure things out. 并且您希望Spark能够解决问题。 But it doesn't work that way. 但这不是那样的。 You could 你可以

input_dataframe = sqlContext.createDataFrame(input_data_1.map(lambda s: s.split(",")), schema)

but honestly just use Spark csv reader: 但老实说，只需使用Spark csv阅读器即可：

spark.read.schema(schema).csv("/path/to/file")

PySpark：TypeError：StructType不能接受类型的对象 <type 'unicode'> 要么 <type 'str'>

问题描述

1 个解决方案

解决方案1
2 2017-12-07 20:27:59

PySpark：TypeError：StructType不能接受类型的对象 <type 'unicode'> 要么 <type 'str'>

问题描述

1 个解决方案

解决方案1 2 2017-12-07 20:27:59

解决方案1
2 2017-12-07 20:27:59