简体   繁体   中英

databases and frontend: load balancing for analyzing data

I have a scrapper which gets news-articles over the day by different sources.

I want to display data like 'most common words in the last 30 days (in source X)' on my page. For now I have saved the articles to my database consisting of the timestamp the article was released and a string of the content. With a few datasets this works fine, but I do no understand how to balance the load, that the front end has most flexibility but not too much data to count.

I thought you could run a script, which takes all the articles from one day and create a new tables containing each word with its count. I came across two points here:

1 - How do I create a table for this? Since every article has different length and different sets of words I would need a table with as many fields, as the number of words in the longest article. I could say I will only save the first 20, but I don't really like the idea.

2 - If the script takes all the articles from one day and calculates the word_counts, I have a minimum resolution of 1 day. So I won't be able to differentiate any further. I chose the script to run for each day to reduce the data that I will need to send to the front on demand.

Don't create a table with a separate column for each of the first 20 words. Please. I beg you. Just don't.

Two possible approaches.

  1. Use a fulltext search feature in your DBMS. You didn't tell us which one you use, so it's hard to be more specific.

  2. Preprocess: Create a table with columns article_id , word_number , and word . This table will have a large number of rows, one for each word in each article. But that's OK. SQL databases are made for handling vast tables of simple rows.

The unique key on the table contains two columns: article_id and word_number . A non-unique key for searching should contain word , article_id , word_number .

When you receive an incoming article, assign it an article_id number. Then break it up into words and insert each word into the table.

When you search for a word do SELECT article_id FROM words WHERE word=? . Fast. And you can use SQL set manipulation to do more complex searches.

When you remove an article from your archive, DELETE the rows with that article_id value.

To get frequencies do SELECT COUNT(*) frequency, word FROM words GROUP BY word ORDER BY 1 DESC LIMIT 50 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM