简体   繁体   English

使用正则表达式计算Spark(Scala)中的Twitter提及数量

[英]Using Regex to the number of twitter mentions in Spark (Scala)

I am new to Spark. 我是Spark的新手。 I want to output the top 2 twitter mentions using this test.txt file: 我想使用此test.txt文件输出前2条Twitter提及内容:

"I love to dance @Kelsey, especially with you @Kelsey!" “我喜欢跳舞@Kelsey,尤其是和你@Kelsey跳舞!”

"Can't believe you went to @harvard. Come on man @harvard" “不敢相信你去了@哈佛。来吧,@哈佛”

"I love @harvard" “我爱@哈佛”

Essentially, multiple mentions in a single tweet only counts once. 本质上,一次推文中的多次提及只算一次。 So the output would be like: 所以输出将是这样的:

(2, @harvard)

(1, @Kelsey)

So far, my codes looks like the following: 到目前为止,我的代码如下所示:

val tweets = sc.textFile("testFile")

val myReg = """(?<=@)([\\w]+)""".r

val mentions = tweets.filter(x => (myReg.pattern.matcher(x).matches))

However, it would not work because x is still a line and it will not match as a result. 但是,这将不起作用,因为x仍然是一行,结果将不匹配。 Is there anyway I can test the word in the line instead of the line itself? 无论如何,我可以测试行中的单词而不是行本身吗? Also, how do I check if that mention is redundant in the tweet? 另外,如何检查该推文是否多余?

I adjusted your regex a little and you might need to translate it back to spark syntax, but this way you find all mentions and group them. 我对您的正则表达式做了一些调整,您可能需要将其转换回Spark语法,但是通过这种方式,您可以找到所有提及的内容并将其分组。 The .toSet is important to remove duplicates, .toLowercase would also make sense there .toSet对于删除重复.toSet很重要, .toLowercase在这里也很有意义

  val tweets = List("I love to dance @Kelsey, especially with you @Kelsey!",
                "Can't believe you went to @harvard. Come on man @harvard",
                "I love @harvard")


  val myReg = """(@\w+)""".r

  val mentions = tweets.flatMap(x => myReg.findAllIn(x).toSet).groupBy(identity).mapValues(_.length)

  println(mentions)

That works for me, the regexs is more tweeter exact 对我有用,正则表达式更精确

val myReg = "(^|[^@\\w])@(\\w{1,15})\\b".r

val mentions = tweets.flatMap(x => myReg.findAllIn(x).matchData.map(_.group(0).trim -> 1)).reduceByKey(_ + _)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM