正则表达式-Spark Scala数据集

Question

I want to get tokens from tweets. 我想从推文中获取令牌。

To achieve this I use RegexTokenizer of Spark 2.0 and scala. 为此，我使用Spark 2.0和scala的RegexTokenizer。 My problem is to achieve the pattern I want. 我的问题是实现我想要的模式。

I have these tweets: 我有这些推文：

0) "#oscars https://w.r/123f5"
1) "#oscars! go leo!"
2) "#oscars: did it!"

And I want to have the tokens: 我想拥有令牌：

0) (#oscars, https://w.r/123f5)
1) (#oscars, go, leo)
2) (#oscars, did, it)

That is, if the tweet has word "#oscar." 也就是说，如果该推文中带有单词“ #oscar”。 or "#oscar!" 或“ #oscar！” or #oscar: ", I want the token to be:" #oscar " At the same time if the tweet has word "leo!" or "it" I want the token to be:"leo" or "it". 或#oscar：“，我希望令牌为：” #oscar“同时如果推文中包含单词” leo！“或” it“，我希望令牌为：” leo“或” it“。

I don't want to disarm urls! 我不想撤消网址！

I try : 我尝试：

val sentenceDataFrame = spark.createDataFrame(Seq(
  (0, "#oscars https://w.r/123f5"),
  (1, "#oscars! go leo!"),
  (2, "#oscars: he did it! ")
)).toDF("label", "sentence")

val regextokenizer = new RegexTokenizer()
  .setGaps(false)
  .setPattern("\\p{L}+")
  .setInputCol("text")
  .setOutputCol("words")

val regexTokenized = regexTokenizer.transform(sentenceDataFrame)

But it doesn't works well. 但是，效果不佳。 I get: 我得到：

(oscars, https, w, r, 123f5)
(oscars, go, leo)
(oscars, he, did, it)

Answer 1

Inside setPattern , use 在setPattern内部，使用

"(?U)\\bhttps?://\\S*|#?\\b\\w+\\b

See the regex demo . 参见regex演示。

Details : the regex matches URLs with \\\\bhttps?://\\\\S* and, with #?\\\\b\\\\w+\\\\b , hashtags or words. 详细信息 ：regex将URL与\\\\bhttps?://\\\\S*匹配，并与#?\\\\b\\\\w+\\\\b ，＃标签或单词匹配。

(?U) - make \\b and \\w to be Unicode aware (?U) -使\\b和\\w能够识别Unicode
\\\\b - a leading word boundary \\\\b前导词边界
https? - http or https http或https
:// - :// literal char sequence :// - ://文字字符序列
\\\\S* - 0+ non-whitespace symbols \\\\S* -0+个非空白符号
| - or - 要么
#? - 1 or 0 # s -1或0 #秒
\\\\b\\\\w+\\\\b - a whole word, 1+ word chars (Unicode aware) within word boundaries. \\\\b\\\\w+\\\\b整个单词，在单词边界内1个单词字符（可识别Unicode）。

正则表达式-Spark Scala数据集

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-12-06 22:15:48

正则表达式-Spark Scala数据集

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-12-06 22:15:48

解决方案1
1 已采纳 2016-12-06 22:15:48