一种很好的机器学习技术，可以清除坏的好URL

Question

I have an application that needs to discriminate between good HTTP GET requests and bad. 我有一个应用程序需要区分良好的HTTP GET请求和坏。

For example:

http://somesite.com?passes=dodgy+parameter                # BAD
http://anothersite.com?passes=a+good+parameter            # GOOD

My system can make a binary decision about whether or not a URL is good or bad - but ideally I would like it to predict whether or not a previously unseen URL is good or bad.

http://some-new-site.com?passes=a+really+dodgy+parameter # BAD

I feel the need for a support vector machine (SVM) ... but I need to learn machine learning. Some questions:

1) Is an SVM appropriate for this task? 1）SVM是否适合此任务？ 2) Can I train it with the raw URLs? 2）我可以使用原始URL进行训练吗？ - without explicitly specifying 'features' 3) How many URLs will I need for it to be good at predictions? - 没有明确指定'功能'3）我需要多少个网址来擅长预测？ 4) What kind of SVM kernel should I use? 4）我应该使用什么样的SVM内核？ 5) After I train it, how do I keep it up to date? 5）训练完毕后，如何保持最新状态？ 6) How do I test unseen URLs again the SVM to decide whether it's good or bad? 6）如何再次测试看不见的URL以确定它是好还是坏？ I 一世

Answer 1

I think that steve and StompChicken both make excellent points: 我认为史蒂夫和StompChicken都提出了很好的观点：

Picking the best algorithm is tricky , even for machine learning experts. 即使对于机器学习专家来说， 选择最佳算法也很棘手 。 Using a general-purpose package like Weka will let you easily compare a bunch of different approaches to determine which works best for your data. 使用像Weka这样的通用软件包可以让您轻松地比较一系列不同的方法，以确定哪种方法最适合您的数据。
Choosing good features is often one of the most important factors in how well a learning algorithm will work. 选择好的功能通常是学习算法运行良好程度的最重要因素之一。

It could also be useful to examine how other people have approached similar problems: 检查其他人如何处理类似问题也很有用：

Qi, X. and Davison, BD 2009. Web page classification: Features and algorithms . Qi，X。和Davison，BD 2009. 网页分类：特征和算法。 ACM Computing Survey 41, 2 (Feb. 2009), 1-31. ACM计算调查41,2（2009年2月），1-31。
Kan, MY and HON Thi (2005). Kan，MY和HON Thi（2005）。 Fast webpage classification using URL features . 使用URL功能进行快速网页分类。 In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM '05) , New York, NY, pp. 325–326. 在第14届ACM国际信息与知识管理会议论文集（CIKM '05） ，纽约，纽约，第325-326页。
Devi, MI, Rajaram, R., and Selvakuberan, K. 2007. Machine Learning Techniques for Automated Web Page Classification Using URL Features . Devi，MI，Rajaram，R。和Selvakuberan，K。2007. 使用URL功能进行自动网页分类的机器学习技术 。 In Proceedings of the international Conference on Computational intelligence and Multimedia Applications (ICCIMA 2007) - Volume 02 (December 13 - 15, 2007). 在计算机智能和多媒体应用国际会议论文集（ICCIMA 2007） - 第02卷 （2007年12月13日至15日）。 Washington, DC, pp. 116-120. 华盛顿特区，第116-120页。

Answer 2

I don't agree with steve that an SVM is a bad choice here, although I also don't think there's much reason to think it will do any better than any other discriminative learning algorithm. 我不同意史蒂夫说SVM在这里是一个糟糕的选择，虽然我也认为没有太多理由认为它会比任何其他判别性学习算法做得更好。
You are going to need to at least think about designing features. 您至少需要考虑设计功能。 This is one of the most important parts of making a machine learning algorithms work well on a certain problem. 这是使机器学习算法在某个问题上运行良好的最重要部分之一。 It's hard to know what to suggest without more idea of the problem. 如果不了解问题，很难知道建议什么。 I guess you could start with counts character n-grams present in the URL as features. 我想你可以从URL中出现的计数字符n-gram开始作为特征。
Nobody really knows how much data you need for any specific problem. 没有人真正知道您需要多少数据才能解决任何特定问题。 The general approach is to get some data, learn a model, see if more training data helps, repeat until you don't get any more significant improvement. 一般的方法是获取一些数据，学习模型，看看是否有更多的训练数据，重复，直到你没有得到任何更重要的改进。
Kernels are a tricky business. 内核是一项棘手的业务。 Some SVM libraries has string kernels which allow you to train on strings without any feature extraction (I'm thinking of SVMsequel , there may be others). 一些SVM库具有字符串内核，允许您在没有任何特征提取的情况下训练字符串（我正在考虑SVMsequel ，可能还有其他的）。 Otherwise, you need to compute numerical or binary features from your data and use the linear, polynomial or RBF kernel. 否则，您需要从数据中计算数值或二进制特征，并使用线性，多项式或RBF内核。 There's no harm in trying them all and it's worth spending some time finding the best settings for the kernel parameters. 尝试所有这些都没有坏处，值得花些时间寻找内核参数的最佳设置。 Your data is also obviously structured and there's no point in letting the learning algorithm try and figure the structure of URLs (unless you care about invalid URLs). 您的数据也显然是结构化的，让学习算法尝试并计算URL的结构是没有意义的（除非您关心无效的URL）。 You should at least split the URL up according to the separators '/', '?', '.', '='. 您至少应该根据分隔符'/'，'？'，'。'，'='拆分URL。
I don't know what you mean by 'keep it up to date'. 我不知道你的意思是'保持最新'。 Retrain the model with whatever new data you have. 使用您拥有的任何新数据重新训练模型。
This depends on the library you use, in svmlight there is a program called svm_classify that takes a model and an example and gives you a class label (good or bad). 这取决于你使用的库，在svmlight中有一个名为svm_classify的程序，它接受一个模型和一个例子，并给你一个类标签（好的或坏的）。 I'm sure it's going to be straightforward to do in any library. 我相信在任何图书馆都可以直截了当地做。

Answer 3

If I understand correctly you just want to learn if a URL is good or bad. 如果我理解正确，您只想了解URL是好还是坏。

A SVM is not appropriate, SVM's are only appropriate if the dataset is very complex and many of the information points are close to the hyperplane. SVM不合适，只有当数据集非常复杂并且许多信息点靠近超平面时，SVM才适用。 You'd use a SVM to add extra dimensions to the data. 您将使用SVM为数据添加额外的维度。

You'd want a few thousand URL's ideally to train your dataset. 理想情况下，您需要几千个URL来训练您的数据集。 The more the better, obviously you could do it with just 100 but your results may not produce good classifications. 越多越好，显然你可以只用100做，但你的结果可能不会产生良好的分类。

I'd suggest you build your data set first and use Weka http://www.cs.waikato.ac.nz/ml/weka/ 我建议你先建立你的数据集并使用Weka http://www.cs.waikato.ac.nz/ml/weka/

You can measure which algorithm gives you the best results. 您可以测量哪种算法可以获得最佳结果。

Answer 4

what dataset will you be using for training , if you have a good dataset, SVM will do good I believe with a good penalty factor. 你将用什么数据集进行训练，如果你有一个好的数据集，SVM会做得很好我相信有一个好的惩罚因子。 If there is no dataset I would suggest to use online algorithms like kNN or even perceptrons. 如果没有数据集，我建议使用在线算法，如kNN甚至感知器。

一种很好的机器学习技术，可以清除坏的好URL

问题描述

4 个解决方案

解决方案1
6 2010-03-12 12:04:55

解决方案2
3 2010-03-12 11:27:48

解决方案3
2 2010-03-11 14:20:09

解决方案4
0 2012-03-05 05:44:14

一种很好的机器学习技术，可以清除坏的好URL

问题描述

4 个解决方案

解决方案1 6 2010-03-12 12:04:55

解决方案2 3 2010-03-12 11:27:48

解决方案3 2 2010-03-11 14:20:09

解决方案4 0 2012-03-05 05:44:14

解决方案1
6 2010-03-12 12:04:55

解决方案2
3 2010-03-12 11:27:48

解决方案3
2 2010-03-11 14:20:09

解决方案4
0 2012-03-05 05:44:14