[英]Regular Expression - Spark scala DataSet
I want to get tokens from tweets. 我想从推文中获取令牌。
To achieve this I use RegexTokenizer of Spark 2.0 and scala. 为此,我使用Spark 2.0和scala的RegexTokenizer。 My problem is to achieve the pattern I want.
我的问题是实现我想要的模式。
I have these tweets: 我有这些推文:
0) "#oscars https://w.r/123f5"
1) "#oscars! go leo!"
2) "#oscars: did it!"
And I want to have the tokens: 我想拥有令牌:
0) (#oscars, https://w.r/123f5)
1) (#oscars, go, leo)
2) (#oscars, did, it)
That is, if the tweet has word "#oscar." 也就是说,如果该推文中带有单词“ #oscar”。 or "#oscar!"
或“ #oscar!” or #oscar: ", I want the token to be:" #oscar " At the same time if the tweet has word "leo!" or "it" I want the token to be:"leo" or "it".
或#oscar:“,我希望令牌为:” #oscar“同时如果推文中包含单词” leo!“或” it“,我希望令牌为:” leo“或” it“。
I don't want to disarm urls! 我不想撤消网址!
I try : 我尝试:
val sentenceDataFrame = spark.createDataFrame(Seq(
(0, "#oscars https://w.r/123f5"),
(1, "#oscars! go leo!"),
(2, "#oscars: he did it! ")
)).toDF("label", "sentence")
val regextokenizer = new RegexTokenizer()
.setGaps(false)
.setPattern("\\p{L}+")
.setInputCol("text")
.setOutputCol("words")
val regexTokenized = regexTokenizer.transform(sentenceDataFrame)
But it doesn't works well. 但是,效果不佳。 I get:
我得到:
(oscars, https, w, r, 123f5)
(oscars, go, leo)
(oscars, he, did, it)
Inside setPattern
, use 在
setPattern
内部,使用
"(?U)\\bhttps?://\\S*|#?\\b\\w+\\b
See the regex demo . 参见regex演示 。
Details : the regex matches URLs with \\\\bhttps?://\\\\S*
and, with #?\\\\b\\\\w+\\\\b
, hashtags or words. 详细信息 :regex将URL与
\\\\bhttps?://\\\\S*
匹配,并与#?\\\\b\\\\w+\\\\b
,#标签或单词匹配。
(?U)
- make \\b
and \\w
to be Unicode aware (?U)
-使\\b
和\\w
能够识别Unicode \\\\b
- a leading word boundary \\\\b
前导词边界 https?
- http
or https
http
或https
://
- ://
literal char sequence ://
- ://
文字字符序列 \\\\S*
- 0+ non-whitespace symbols \\\\S*
-0+个非空白符号 |
- or #?
- 1 or 0 #
s #
秒 \\\\b\\\\w+\\\\b
- a whole word, 1+ word chars (Unicode aware) within word boundaries. \\\\b\\\\w+\\\\b
整个单词,在单词边界内1个单词字符(可识别Unicode)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.