简体   繁体   中英

Database Design for hierarchical Data

I am trying to design a database to store large amount of urls. Now i want to have count of different combinations among parts of url. for example how many time mens section comes with flipkart? how many time flipkart has come? any idea how to design it effeciently?

Add 'domain' column with index

create table URLS(
    id longint primary key,
    full_url varchar(255), 
    domain varchar(100)
    page_name varchar(100));
create index on URLS (domain);
create table parameters(
    id longint foreign key referencing URLS(id),
    param_name varchar(100),
    param_value varchar(100));

select count(a.full_url)
from URLS a, parameters b
where a.id=b.id
    and (b.param_name='user' and b.param_value='Jack');

You can use a structure similar to the one posted by @rjhdby. But apart from that, you'll need some programmatic consolidation of URL paths for the so-called 'sections'. Some are not really useful depending on the site. So you need a mapping which you can grow/build and then periodically filter on to determine which parts of URLs you consider useful. That could happen automatically since common parts will repeat and be unique to that site's URLs. But you also need to take into account sites where the authentication key or some other token is included in the middle of a URL and you need to avoid those. They will always be unique and of no use in this analysis.

Assuming you're building this for a known finite set of sites, this is possible to do. However, if you're say, at an ISPs gate, you'll get an insane number of unique sites. Any mapping/filtering to determine the correct unique parts of their paths will be an enormous task.

As an example of the 'keys' that appear in a URL, take a twitter status: https://twitter.com/aneroidx/status/427684072920342528 ( link ).
Here, the first part is the domain , then username (but not really 'path'), then ' status ' which is a part you know is a section to monitor, and then an unique 'tweet id' - which you wouldn't consider a section either. So you may need to determine these correct unique sections before you put them into a database, or put them in raw and run a separate program to create the proper entries for unique sections based on filters/rules like that above.

This is an much about hierarchical data as it is about filtering URL paths and sections correctly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM