How to identify a new pattern in a URL with a machine learning algorithm (Text mining)

Question

I am trying to identify new patterns after analyzing a number of URLs. So let's say, I am investigating the hypothetical website Yoohle.com and their URLs have the following structure.

domain = yoohle.com
q= search phrase
lan= language used
pr= partner_id
br= browser_id

so a sample url will look like this

www.yoohle.com/test_folder/test_page?q=hello+world&lan=en&pr=stackoverflow&br=chrome

If I am investigating the web traffic of this website and seeing abnormal increase month over month, I would like to find out what's causing this. In this example I can just parse out the URL and look at the pr= value since it will tell me if there is a new partnership (maybe stackoverflow is going to be powered by yoohle.com and that drives the increase etc.)

The question is, how can I build something robust that can compare 2 (or more) months and tell me exactly what's driving the increase. I want to get something like, "we are seeing an increase and it is driven by the following pattern"

www.yoohle.com/test_folder/test_page%pr=stackoverflow%

The tricky part is, you do not know anything about what the tokens mean unlike this example since I will not know what token stands for partner_id. Another issue is, if we look at token by token, this will be misleading because lan=en will also go up with a new partner assuming the users will still have English as the language.

My idea is to analyze the tokens by looking at all the combinations but it is very costly, (4! in this example and probably 10+! for other websites). Also analyzing tokens itself is not going to solve the problem since I still need to analyze the values of the tokens.

I tried k-means clustering, apriori algorithm did some research on URL/text mining but could not get what I want. Any ideas about how to approach building an algorithm will be beneficial.

Imagine that you are seeing realtime data, so we are talking about analyzing around 100K URLs in a given month.

Answer 1

I would go the following way. You can create the following table:

URL
time
time_month -- time rounded to month, for demonstration purpose
q_bol   -- boolean flag whether question parameter was used
q       -- question parameter value
lan     -- language parameter value
lan_bol -- boolean flag whether language parameter was used
pr      -- partner parameter value
pr_bol  -- boolean flag whether partner parameter was used
br      -- browser parameter value
br_bol  -- boolean flag whether browse parameter was used

Now, you can write some query.

with t as (
select 
  time_month,
  q_bol, lan_bol, pr_bol, br_bol, count(*)
from
  urldata
where
  time_month > '2013-02-01'::date and time_month < '2013-04-01'::date -- last two months data
group by 
  time_month
)

, u as (
select
*,
t2-coalesce(t1,0) as abs_change, -- change in pattern MoM,
case when t1 is null then 0 else t2/t1 end as relchange  -- relative change
from
t t1 full outer join t t2 using (q_bol, lan_bol, pr_bol, br_bol)
) 

select * from u where abs_change > 5000 or relchange > 3

The query above gives you parameters patterns where there is more than 5000 change month over month or more than 300% increase month over month. If you can use group by rollup in your sql system it would give also higher level aggregations (combinations of three parameters, two parameters, one parameter).

You can do pretty the same with values of parameters. Because you do not know what tokens will be present with values, you can parse url in the following structure of tables:

-- urls
id_url
url
time

-- parameters
id_url
token
value

Then you will need to rewrite the query above in some way, eg you can use array aggregation function in PostgreSQL array_agg() .

How to identify a new pattern in a URL with a machine learning algorithm (Text mining)

Question

1 answers

solution1
0 2013-04-12 13:47:23

How to identify a new pattern in a URL with a machine learning algorithm (Text mining)

Question

1 answers

solution1 0 2013-04-12 13:47:23

solution1
0 2013-04-12 13:47:23