[英]How to read CSV file in Python spark - Error
你能幫我看看這段代碼有什么錯誤嗎,文件確實存在,但我知道你正在 HDFS sc.textFile("/user/spark/archivo.csv") 中尋找它
或者為什么會出現這個錯誤?
執行
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
spark-submit --queue=OID Proceso_Match1.py
Python
import os
import sys
from pyspark.sql import HiveContext, Row
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.types import *
if __name__ =='__main__':
conf=SparkConf().setAppName("Spark RDD").set("spark.speculation","true")
sc=SparkContext(conf=conf)
sc.setLogLevel("OFF")
sqlContext = HiveContext(sc)
#rddCentral = sc.textFile("hdfs:///user/spark/archivo.csv")
rddCentral = sc.textFile("/user/spark/archivo.csv")
rddCentralMap = rddCentral.map(lambda line : line.split(","))
print('paso 1')
dfCentral = sqlContext.createDataFrame(rddCentralMap, ["ROWID_CDR","DURACION","FECHA_LLAMADA","FECHA_LLAMADA_2","MATCH"])
dfCentral=dfCentral.withColumn("FECHA_LLAMADA_NUM",dfCentral.FECHA_LLAMADA_2.cast(IntegerType()))
dfCentral=dfCentral.withColumn("DURACION_NUM",dfCentral.DURACION.cast(IntegerType()))
dfCentral=dfCentral.withColumn("MATCH_NUM",dfCentral.MATCH.cast(IntegerType()))
sc.stop()
錯誤記錄
22/09/30 12:49:14 INFO cluster.YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
paso 1
/usr/local/bin/python3/lib/python3.7/site-packages/pandas/compat/__init__.py:124: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
warnings.warn(msg)
Traceback (most recent call last):
File "/home/aic_proceso_vfs/rjaimea/vfs_504/bin/Proceso_Match1.py", line 21, in <module>
dfCentral = sqlContext.createDataFrame(rddCentralMap, ["ROWID_CDR","DURACION","FECHA_LLAMADA","FECHA_LLAMADA_2","MATCH"])
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, cl-hdp-cdp-dn7.cse-cph.int, executor 1): java.io.IOException: Cannot run program "python3": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
Caused by: java.io.IOException: error=2, No such file or directory
... 16 more
檔案 HDFS
hdfs dfs -ls /user/spark
Found 3 items
drwxr-xr-x - spark hdfs 0 2022-07-25 10:11 /user/spark/.sparkStaging
-rw------- 3 hadoopadmin hdfs 21 2022-09-30 12:25 /user/spark/archivo.csv
drwxrwxrwt - spark spark 0 2022-09-30 12:33 /user/spark/driverLogs
我不確定,但您似乎誤用了正在創建的數據庫的架構
行dfCentral = sqlContext.createDataFrame(rddCentralMap, ["ROWID_CDR","DURACION","FECHA_LLAMADA","FECHA_LLAMADA_2","MATCH"])
將表示數據幀數據的類字典 object 作為第一個參數,將模式作為第二個參數。 您剛剛給出了一個字符串列表作為第二個參數。
為了使用某些模式創建 dataframe,您必須構建該列表中的字段,然后構建列表
所以你的程序看起來像這樣
import os
import sys
from pyspark.sql import HiveContext, Row
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.types import * #where structType and structField come from
if __name__ =='__main__':
conf=SparkConf().setAppName("Spark RDD").set("spark.speculation","true")
sc=SparkContext(conf=conf)
sc.setLogLevel("OFF")
sqlContext = HiveContext(sc)
#rddCentral = sc.textFile("hdfs:///user/spark/archivo.csv")
rddCentral = sc.textFile("/user/spark/archivo.csv")
rddCentralMap = rddCentral.map(lambda line : line.split(","))
print('paso 1')
dfFields = ["ROWID_CDR","DURACION","FECHA_LLAMADA","FECHA_LLAMADA_2","MATCH"]
dfSchema = StructType([StructField(field_name, StringType(), True) for field_name in dfFields])
dfCentral = sqlContext.createDataFrame(rddCentralMap, dfSchema)
dfCentral=dfCentral.withColumn("FECHA_LLAMADA_NUM",dfCentral.FECHA_LLAMADA_2.cast(IntegerType()))
dfCentral=dfCentral.withColumn("DURACION_NUM",dfCentral.DURACION.cast(IntegerType()))
dfCentral=dfCentral.withColumn("MATCH_NUM",dfCentral.MATCH.cast(IntegerType()))
sc.stop()
或者, createDataFrame function將RDD作為第一個參數。 映射通過讀取文件創建的 RDD 也可能會導致您的問題
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.