简体   繁体   English

在 Spark 中读取 CSV 文件时出错 - Scala

[英]Error while reading a CSV file in Spark - Scala

I am trying to read a CSV file in Spark - using CSV reader API.我正在尝试使用 CSV 阅读器 API 在 Spark 中读取 CSV 文件。 I am currently encountering array index out of bound exception.我目前遇到数组索引越界异常。


There is no issue with the input file. All the rows have same number of columns. Column count - 65

Putting below the code that I tried.把我试过的代码放在下面。

sparkSess.read.option("header", "true").option("delimiter", "|").csv(filePath)

Expected result - dataFrame.show()预期结果 - dataFrame.show()

Actual Error -实际错误 -

19/03/28 10:42:51 INFO FileScanRDD: Reading File path: file:///C:/Users/testing/workspace_xxxx/abc_Reports/src/test/java/report1.csv, range: 0-10542, partition values: [empty row]
19/03/28 10:42:51 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 6)
java.lang.ArrayIndexOutOfBoundsException: 63
    at org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
    at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

Input Data ::输入数据 ::

QW|8|2344|H02|1002|              |1|2019-01-20|9999-12-31|  |EE|2014-01-20|2014-01-20|2014-01-20|CNB22345            |IN|9|1234444| |        |        |10|QQ|8|BMX10290M|EWR|   |.000000000|00|M |2027-01-20|2027-01-20| |.00|.00|.00|.00|2014-01-20|1901-01-01|3423.25|  |          |          |      |RE|WW|  |RQ|   |   |   |        |     |        |  | |1901-01-01|0|SED2233345   |2019-01-15 22:10:23|213EDSFDS |78978775|2019-03-23 07:38:34.823000000|        |

Just found the exact issue.刚刚找到确切的问题。

Actually, the 10 CSV files that I was trying to read were UTF-8 format files.实际上,我尝试读取的 10 个 CSV 文件是 UTF-8 格式的文件。 Which were NOT causing the issue.哪个没有引起问题。 3 Files out of the total 13 files were UCS-2 formatted.总共 13 个文件中有 3 个文件是 UCS-2 格式的。 Hence these were causing the issue with CSV read process.因此,这些导致了 CSV 读取过程的问题。 These files were the ones causing the above mentioned error.这些文件是导致上述错误的文件。

UTF-8 ==> Unicode Transformation Format Encoding.
UCS-2 ==> Universal Coded Character Set Encoding.

By this, learnt that databricks CSV read supports UTF encoding and causes issues for UCS encoding.由此,了解到数据块 CSV 读取支持 UTF 编码并导致 UCS 编码出现问题。 Hence, saved the files as UTF-8 format and tried reading the file.因此,将文件保存为 UTF-8 格式并尝试读取文件。 It worked like a charm.它就像一个魅力。

Feel free to add more insights on this, if any.如果有的话,请随意添加更多关于此的见解。

You can you use com.databricks.spark.csv to read csv files.Please find sample code as below.您可以使用com.databricks.spark.csv读取 csv 文件。请找到以下示例代码。

   import org.apache.spark.sql.SparkSession

object SparkCSVTest extends App {

  val spark = SparkSession

  val df = spark.read
    .option("header", "true")
    .option("delimiter", "|")
    .option("inferSchema", "false")



CSV file used:使用的 CSV 文件:

QW|8|2344|H02|1002|              |1|2019-01-20|9999-12-31|  |EE|2014-01-20|2014-01-20|2014-01-20|CNB22345            |IN|9|1234444| |        |        |10|QQ|8|BMX10290M|EWR|   |.000000000|00|M |2027-01-20|2027-01-20| |.00|.00|.00|.00|2014-01-20|1901-01-01|3423.25|  |          |          |      |RE|WW|  |RQ|   |   |   |        |     |        |  | |1901-01-01|0|SED2233345   |2019-01-15 22:10:23|213EDSFDS |78978775|2019-03-23 07:38:34.823000000|        |

With Header :带有标题:

|  A|  B|   C|  D|   E|             F|  G|         H|         I|  J|  K|         L|         M|         N|                   O|  P|  Q|      R|  S|       T|       U|  V|  W|  X|        Y|  Z| AA|        BB| CC| DD|        EE|        FF| GG| HH| II| JJ| KK|        LL|        MM|     NN| OO|        PP|        QQ|    RR| SS| TT| UU| VV| WW| XX| YY|      ZZ| TGHJ|      HG|EEE|ASD|  EFFDCLDT|QSAS|          WWW|             DATIME|     JOBNM|  VFDCXS|                REWE|  XCVVCX|ASDFF|
| QW|  8|2344|H02|1002|              |  1|2019-01-20|9999-12-31|   | EE|2014-01-20|2014-01-20|2014-01-20|CNB22345            | IN|  9|1234444|   |        |        | 10| QQ|  8|BMX10290M|EWR|   |.000000000| 00| M |2027-01-20|2027-01-20|   |.00|.00|.00|.00|2014-01-20|1901-01-01|3423.25|   |          |          |      | RE| WW|   | RQ|   |   |   |        |     |        |   |   |1901-01-01|   0|SED2233345   |2019-01-15 22:10:23|213EDSFDS |78978775|2019-03-23 07:38:...|        | null|

Without Header:没有标题:

|_c0|_c1| _c2|_c3| _c4|           _c5|_c6|       _c7|       _c8|_c9|_c10|      _c11|      _c12|      _c13|                _c14|_c15|_c16|   _c17|_c18|    _c19|    _c20|_c21|_c22|_c23|     _c24|_c25|_c26|      _c27|_c28|_c29|      _c30|      _c31|_c32|_c33|_c34|_c35|_c36|      _c37|      _c38|   _c39|_c40|      _c41|      _c42|  _c43|_c44|_c45|_c46|_c47|_c48|_c49|_c50|    _c51| _c52|    _c53|_c54|_c55|      _c56|_c57|         _c58|               _c59|      _c60|    _c61|                _c62|    _c63| _c64|
|  A|  B|   C|  D|   E|             F|  G|         H|         I|  J|   K|         L|         M|         N|                   O|   P|   Q|      R|   S|       T|       U|   V|   W|   X|        Y|   Z|  AA|        BB|  CC|  DD|        EE|        FF|  GG|  HH|  II|  JJ|  KK|        LL|        MM|     NN|  OO|        PP|        QQ|    RR|  SS|  TT|  UU|  VV|  WW|  XX|  YY|      ZZ| TGHJ|      HG| EEE| ASD|  EFFDCLDT|QSAS|          WWW|             DATIME|     JOBNM|  VFDCXS|                REWE|  XCVVCX|ASDFF|
| QW|  8|2344|H02|1002|              |  1|2019-01-20|9999-12-31|   |  EE|2014-01-20|2014-01-20|2014-01-20|CNB22345            |  IN|   9|1234444|    |        |        |  10|  QQ|   8|BMX10290M| EWR|    |.000000000|  00|  M |2027-01-20|2027-01-20|    | .00| .00| .00| .00|2014-01-20|1901-01-01|3423.25|    |          |          |      |  RE|  WW|    |  RQ|    |    |    |        |     |        |    |    |1901-01-01|   0|SED2233345   |2019-01-15 22:10:23|213EDSFDS |78978775|2019-03-23 07:38:...|        | null|


    "com.databricks" %% "spark-csv" % "1.5.0",
    "org.apache.spark" %% "spark-core" % "2.2.2",
    "org.apache.spark" %% "spark-sql" % "2.2.2"

Screen Shot for Ref.参考屏幕截图。 :


Hope it helps!希望能帮助到你!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM