简体   繁体   English

在 spark scala 数据帧的列中使用非英语字符

[英]Working with non-english characters in columns of spark scala dataframes

Here is part of a file I am trying to load into a dataframe:这是我试图加载到 dataframe 中的文件的一部分:

alphabet|Sentence|Comment1字母|句子|评论1

è|Small e|None è|小e|无

Ü|Capital U|None Ü|大写U|无

ã|Small a| ã|小a|

Ç|Capital C|None Ç|大写C|无

When I load this file into a dataframe all the non-english characters get converted into boxes.当我将此文件加载到 dataframe 时,所有非英语字符都被转换为框。 Tried to give option("encoding","UTF-8") , but there is no change.尝试提供option("encoding","UTF-8") ,但没有任何变化。

val nonEnglishDF = spark.read.format("com.databricks.spark.csv").option("delimiter","|").option("header",true).option("encoding","UTF-8").load(hdfs file path)

Please let me know is there is any solution for this.请让我知道是否有任何解决方案。 I need to save the file finally with no change in the non-english characters.我需要最终保存文件而不改变非英文字符。 Currently when the file is saved, it puts boxes or question mark instead of the non-english characters.当前保存文件时,它会放置框或问号而不是非英文字符。

use decode function on that column:在该列上使用decode function:

decode(col("column_name"), "US-ASCII")

//It should work with one of these ('US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16')

It works with option("encoding","ISO-8859-1").它与选项(“编码”,“ISO-8859-1”)一起使用。 eg例如

val nonEnglishDF = spark.read.format("com.databricks.spark.csv").option("delimiter","|").option("header",true).option("encoding","ISO-8859-1").load(hdfs file path)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM