简体   繁体   中英

impossible to read a csv file ith pyspark

I try to read a csv file using pyspark with this pyspark code:

tr_df = spark.read.csv("/data/file.csv",
                       header=True, inferSchema=True
                      )
tr_df.head(5)

But I get this error:

 ValueError Traceback (most recent call last) <ipython-input-53-03432bbf269d> in <module> ----> 1 tr_df.head(5) ~/anaconda3/envs/naboo-env/lib/python3.6/site-packages/pyspark/sql/dataframe.py

in head(self, n) 1250 rs = self.head(1) 1251 return rs[0] if rs else None -> 1252 return self.take(n) 1253 1254 @ignore_unicode_prefix

 ~/anaconda3/envs/naboo-env/lib/python3.6/site-packages/pyspark/sql/dataframe.py

in take(self, num) 569 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] 570 """ --> 571 return self.limit(num).collect() 572 573 @since(1.3)

 ~/anaconda3/envs/naboo-env/lib/python3.6/site-packages/pyspark/sql/dataframe.py

in collect(self) 532 with SCCallSiteSync(self._sc) as css: 533 sock_info = self._jdf.collectToPython() --> 534 return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer()))) 535 536 @ignore_unicode_prefix

 ~/anaconda3/envs/naboo-env/lib/python3.6/site-packages/pyspark/serializers.py

in load_stream(self, stream) 145 while True: 146 try: --> 147 yield self._read_with_length(stream) 148 except EOFError: 149 return

 ~/anaconda3/envs/naboo-env/lib/python3.6/site-packages/pyspark/serializers.py

in _read_with_length(self, stream) 170 if len(obj) < length: 171 raise EOFError --> 172 return self.loads(obj) 173 174 def dumps(self, obj):

 ~/anaconda3/envs/naboo-env/lib/python3.6/site-packages/pyspark/serializers.py

in loads(self, obj, encoding) 578 if sys.version >= '3': 579 def loads(self, obj, encoding="bytes"): --> 580 return pickle.loads(obj, encoding=encoding) 581 else: 582 def loads(self, obj, encoding=None):

 ~/anaconda3/envs/naboo-env/lib/python3.6/site-packages/pyspark/sql/types.py

in _parse_datatype_json_string(json_string) 867 >>> check_datatype(complex_maptype) 868 """ --> 869 return _parse_datatype_json_value(json.loads(json_string)) 870 871

 ~/anaconda3/envs/naboo-env/lib/python3.6/site-packages/pyspark/sql/types.py

in _parse_datatype_json_value(json_value) 884 tpe = json_value["type"] 885 if tpe in _all_complex_types: --> 886 return _all_complex_types[tpe].fromJson(json_value) 887 elif tpe == 'udt': 888 return UserDefinedType.fromJson(json_value)

 ~/anaconda3/envs/naboo-env/lib/python3.6/site-packages/pyspark/sql/types.py

in fromJson(cls, json) 575 @classmethod 576 def fromJson(cls, json): --> 577 return StructType([StructField.fromJson(f) for f in json["fields"]]) 578 579 def fieldNames(self):

 ~/anaconda3/envs/naboo-env/lib/python3.6/site-packages/pyspark/sql/types.py

in (.0) 575 @classmethod 576 def fromJson(cls, json): --> 577 return StructType([StructField.fromJson(f) for f in json["fields"]]) 578 579 def fieldNames(self):

 ~/anaconda3/envs/naboo-env/lib/python3.6/site-packages/pyspark/sql/types.py

in fromJson(cls, json) 432 def fromJson(cls, json): 433 return StructField(json["name"], --> 434 _parse_datatype_json_value(json["type"]), 435 json["nullable"], 436 json["metadata"])

 ~/anaconda3/envs/naboo-env/lib/python3.6/site-packages/pyspark/sql/types.py

in _parse_datatype_json_value(json_value) 880 return DecimalType(int(m.group(1)), int(m.group(2))) 881 else: --> 882 raise ValueError("Could not parse datatype: %s" % json_value) 883 else: 884 tpe = json_value["type"]

 ValueError: Could not parse datatype: decimal(17,-24)

Can anyone help me to resolve this problem please?

Thanks

Seems there is a problem with datatype in one of your columns. Hence its throwing error. Remove inferSchema =True option while reading. After reading the data,try to analayze datatype and make any corrections if needed, then apply your own schema.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM