简体   繁体   English

正则表达式-Spark Scala数据集

[英]Regular Expression - Spark scala DataSet

I want to get tokens from tweets. 我想从推文中获取令牌。

To achieve this I use RegexTokenizer of Spark 2.0 and scala. 为此,我使用Spark 2.0和scala的RegexTokenizer。 My problem is to achieve the pattern I want. 我的问题是实现我想要的模式。

I have these tweets: 我有这些推文:

0) "#oscars https://w.r/123f5"
1) "#oscars! go leo!"
2) "#oscars: did it!"

And I want to have the tokens: 我想拥有令牌:

0) (#oscars, https://w.r/123f5)
1) (#oscars, go, leo)
2) (#oscars, did, it)

That is, if the tweet has word "#oscar." 也就是说,如果该推文中带有单词“ #oscar”。 or "#oscar!" 或“ #oscar!” or #oscar: ", I want the token to be:" #oscar " At the same time if the tweet has word "leo!" or "it" I want the token to be:"leo" or "it". 或#oscar:“,我希望令牌为:” #oscar“同时如果推文中包含单词” leo!“或” it“,我希望令牌为:” leo“或” it“。

I don't want to disarm urls! 我不想撤消网址!

I try : 我尝试:

val sentenceDataFrame = spark.createDataFrame(Seq(
  (0, "#oscars https://w.r/123f5"),
  (1, "#oscars! go leo!"),
  (2, "#oscars: he did it! ")
)).toDF("label", "sentence")

val regextokenizer = new RegexTokenizer()
  .setGaps(false)
  .setPattern("\\p{L}+")
  .setInputCol("text")
  .setOutputCol("words")

val regexTokenized = regexTokenizer.transform(sentenceDataFrame)

But it doesn't works well. 但是,效果不佳。 I get: 我得到:

(oscars, https, w, r, 123f5)
(oscars, go, leo)
(oscars, he, did, it)

Inside setPattern , use setPattern内部,使用

"(?U)\\bhttps?://\\S*|#?\\b\\w+\\b

See the regex demo . 参见regex演示

Details : the regex matches URLs with \\\\bhttps?://\\\\S* and, with #?\\\\b\\\\w+\\\\b , hashtags or words. 详细信息 :regex将URL与\\\\bhttps?://\\\\S*匹配,并与#?\\\\b\\\\w+\\\\b ,#标签或单词匹配。

  • (?U) - make \\b and \\w to be Unicode aware (?U) -使\\b\\w能够识别Unicode
  • \\\\b - a leading word boundary \\\\b前导词边界
  • https? - http or https httphttps
  • :// - :// literal char sequence :// - ://文字字符序列
  • \\\\S* - 0+ non-whitespace symbols \\\\S* -0+个非空白符号
  • | - or - 要么
  • #? - 1 or 0 # s -1或0 #
  • \\\\b\\\\w+\\\\b - a whole word, 1+ word chars (Unicode aware) within word boundaries. \\\\b\\\\w+\\\\b整个单词,在单词边界内1个单词字符(可识别Unicode)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM