比较两个数据帧以查找Spark中的子字符串

Question

I have three dataframes, dictionary,SourceDictionary and MappedDictionary. 我有三个数据框，字典，SourceDictionary和MappedDictionary。 The dictionary andSourceDictionary have only one column, say words as String. 字典和SourceDictionary只有一列，将单词说成String。 The dictionary which has million records, is a subset of MappedDictionary (Around 10M records) and each record in MappedDictionary is substring of dictionary. 具有百万条记录的字典是MappedDictionary的子集（大约10M条记录），并且MappedDictionary中的每个记录都是字典的子字符串。 So, I need to map the ditionary with SourceDictionary to MappedDictionary. 因此，我需要将带有SourceDictionary的字典映射到MappedDictionary。 Example: 例：

Records in ditionary : BananaFruit, AppleGreen
Records in SourceDictionary : Banana,grape,orange,lemon,Apple,...

Records to be mapped in MappedDictionary (Contains two columns) : 要映射到MappedDictionary中的记录（包含两列）：

BananaFruit Banana
AppleGreen Apple

I planned to do like two for loops in java and make substring operation but the problem is 1 million * 10 million = 10 Trillion iterations Also, I can't get correct way to iterate a dataframe like a for loop Can someone give a solution for a way to make iteration in Dataframe and perform substring operations? 我计划在Java中像两个for循环一样进行子字符串操作，但是问题是1百万* 10百万= 10万亿次迭代另外，我无法获得正确的方法来像for循环那样迭代数据框有人可以提供解决方案吗一种在Dataframe中进行迭代并执行子字符串操作的方法？ Sorry for my poor English, I am a non-native Thanks for stackoverflow community members in advance :-) 对不起，我的英语不好，我是一个非本地用户。谢谢您提前的stackoverflow社区成员:-)

Answer 1

Though you have million record in sourceDictionary because it has only one column broadcasting it to every node won't take up much memory and it will speed up the total performance. 尽管您在sourceDictionary中有上百万条记录，因为它只有一列向每个节点广播，因此不会占用太多内存，并且可以提高整体性能。

import org.apache.spark.broadcast.Broadcast
import org.apache.spark.sql.catalyst.encoders.RowEncoder

//Assuming the schema names
val SourceDictionarySchema = StructType(StructField("value",StringType,false)) 
val dictionarySchema = StructType(StructField("value",StringType,false))
val MappedDictionary = StructType(StructField("value",StringType,false), StructField("key",StringType,false))

val sourceDictionaryBC = sc.broadcast(
   sourceDictionary.map(row =>
      row.getAs[String]("value")
   ).collect.toList
)

val MappedDictionaryN = dictionary.map(row =>
   val value = row.getAs[String]("value")
   val matchedKey = sourceDictionaryBC.value.find(value.contains)

   Seq(value, matchedKey.orNull)
)(RowEncoder(MappedDictionary))

After this you have all the new mapped records. 之后，您将拥有所有新的映射记录。 If you want to combine it with the existing MappedDictionary just do a simple union. 如果要将其与现有的MappedDictionary结合起来，只需执行简单的合并即可。

MappedDictionaryN.union(MappedDictionary)

比较两个数据帧以查找Spark中的子字符串

问题描述

1 个解决方案

解决方案1
1 2017-03-21 16:56:15

比较两个数据帧以查找Spark中的子字符串

问题描述

1 个解决方案

解决方案1 1 2017-03-21 16:56:15

解决方案1
1 2017-03-21 16:56:15