简体   繁体   English

确定文章质量的算法

[英]Algorithm to determine quality of an article

I am working on a project that requires me to parse news articles and determine the best among them. 我正在开展一个项目,要求我解析新闻文章并确定其中最好的文章。 I figured out that to determine the quality of an article, I would need three main parameters: Length of an article, facebook shares/ retweets and the time since the article was posted. 我发现要确定文章的质量,我需要三个主要参数:文章的长度,facebook分享/转推以及文章发布以来的时间。

The problem I am facing now is how do I put together all three parameters in a mathematical function and come-up with a score for each of the articles? 我现在面临的问题是如何将所有三个参数放在一个数学函数中,并为每篇文章得出一个分数? The score assigned to each one of them would help me rank the articles and show it to the users. 分配给每个人的分数将帮助我对文章进行排名并将其显示给用户。

Also let me know if there is any other parameter that I need to consider in determining the quality. 如果在确定质量时需要考虑任何其他参数,请告诉我。

I'm not sure what the exact nature of your project is but this task is very hard to do accurately. 我不确定你项目的确切性质是什么,但这个任务很难准确地完成。 How do you take into account the fact that articles that are shared/liked most are often the ones that are most polarizing. 你如何考虑这样一个事实,即分享/喜欢最多的文章往往是最极端化的文章。 Number of likes/shares is also clearly influenced by how popular the news-site is. 喜欢/分享的数量也明显受到新闻网站受欢迎程度的影响。 I would think that any kind of automated text analysis will not be accurate enough and could be easily abused. 我认为任何类型的自动文本分析都不够准确,很容易被滥用。 Your best bet then is to look for indicative proxies such as: 那么您最好的选择是寻找指示性代理,例如:

  • Reputability of the site as measured by ranking in google search results 通过谷歌搜索结果中的排名来衡量网站的声誉
  • Popularity of the site as measured by traffic 以流量衡量的网站的受欢迎程度
  • Number of facebook likes/shares as you mentioned 你提到的facebook喜欢/分享的数量
  • Number of places on the internet that linked to the article. 互联网上与文章相关的地方数量。

Since a dataset that contains article grades will be hard to come by you probably won't be able to do any kind of statistic analysis. 由于包含文章成绩的数据集很难获得,因此您可能无法进行任何类型的统计分析。 Instead you'll just have to make up a formula and weigh the parameters with your best judgement. 相反,你只需要制定一个公式,并用最好的判断权衡参数。 To back this up a little bit maybe hand grade a few articles and see what different formulas give you. 为了支持这一点,可以手工评分一些文章,看看有什么不同的公式给你。

What you desire is stunning easy to achieve. 你想要的是很容易实现的。 You have to kinds of data, that your are interested in: increasing and decreasing data. 您必须拥有您感兴趣的各种数据:增加和减少数据。 Increasing data is considered as "good", well, as long as it increases. 只要数据增加,增加的数据就被视为“好”。 Decreasing data is considered as "better" the nearer it is to zero. 减少数据被认为越接近零越“越好”。

It turns out that all of the four datasets are simple integers: 事实证明,所有四个数据集都是简单的整数:

increasing data 增加数据

  • shares: positive integer s \\in N_0 (every integer from zero to infinity) shares: s \\in N_0正整数s \\in N_0 (从0到无穷大的每个整数)
  • retweets: positive integer r \\in N_0 转推: r \\in N_0正整数r \\in N_0

decreasing data 减少数据

For decreasing data you want to use the absolute value as a metric: 要减少数据,您希望将绝对值用作指标:

  • Let t_0 be the timestamp (unix or so) of the article. t_0是文章的时间戳(unix左右)。
  • Let T be the current timestamp. T为当前时间戳。
  • Let l_0 denote the length of an article considered as "best". l_0表示被认为是“最佳”的文章的长度。
  • Let L denote the actual length of the article. L表示物品的实际长度。

Then: 然后:

  • time: |t_0 - T| 时间: |t_0 - T| the better the nearer to zero 越接近零越好
  • length: |l_0 - L| 长度: |l_0 - L| the better the nearer to zero 越接近零越好

since the absolute value are positive integers it follows: 因为绝对值是正整数,所以它遵循:

|l_0 - L| + |t_0 - T| is nearer to zero as |t_0 - T| 因为|t_0 - T|更接近于零 and |l_0 - L| |l_0 - L| are nearer to zero. 接近于零。

The same is true for the increasing numbers. 越来越多的数字也是如此。

So, the more likely an article is to be of the "correct" length and new, the nearer this number is to zero. 因此,文章越有可能是“正确的”长度和新的,这个数字越接近零。

conclusion 结论

the quotient of an increasing number over a decreasing is itself increasing. 增加数量而不是减少量的商数本身在增加。 Think about it: the smaller the denominator the bigger the quotient. 想一想:分母越小,商数越大。 The bigger the numerator the bigger the quotient. 分子越大,商越大。

That means: If considered as "better" the quotient 这意味着:如果被认为是“更好”的商

(s+r) / (|l_0 - L| + |t_0 - T|)

rises. 上升。

This is not necessarily an integer anymore. 这不一定是整数。

Enhancement 增强

You can soften the rise of shares and retweets, so the score becomes little more "natural" by using ln . 你可以软化股票和转发的上涨,因此使用ln得分变得更加“自然”。

ln(s+r) / (|l_0 - L| + |t_0 - T|)

You could use exp to soften the denominator: 您可以使用exp来软化分母:

ln(s+r) / exp(-(|l_0 - L| + |t_0 - T|))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM