简体繁体中英

databases and frontend: load balancing for analyzing data

原文 2022-06-22 11:23:46 0 1 database/ performance/ text/ load

I have a scrapper which gets news-articles over the day by different sources.

I want to display data like 'most common words in the last 30 days (in source X)' on my page. For now I have saved the articles to my database consisting of the timestamp the article was released and a string of the content. With a few datasets this works fine, but I do no understand how to balance the load, that the front end has most flexibility but not too much data to count.

I thought you could run a script, which takes all the articles from one day and create a new tables containing each word with its count. I came across two points here:

1 - How do I create a table for this? Since every article has different length and different sets of words I would need a table with as many fields, as the number of words in the longest article. I could say I will only save the first 20, but I don't really like the idea.

2 - If the script takes all the articles from one day and calculates the word_counts, I have a minimum resolution of 1 day. So I won't be able to differentiate any further. I chose the script to run for each day to reduce the data that I will need to send to the front on demand.

1 answers

Don't create a table with a separate column for each of the first 20 words. Please. I beg you. Just don't.

Two possible approaches.

Use a fulltext search feature in your DBMS. You didn't tell us which one you use, so it's hard to be more specific.
Preprocess: Create a table with columns article_id , word_number , and word . This table will have a large number of rows, one for each word in each article. But that's OK. SQL databases are made for handling vast tables of simple rows.

The unique key on the table contains two columns: article_id and word_number . A non-unique key for searching should contain word , article_id , word_number .

When you receive an incoming article, assign it an article_id number. Then break it up into words and insert each word into the table.

When you search for a word do SELECT article_id FROM words WHERE word=? . Fast. And you can use SQL set manipulation to do more complex searches.

When you remove an article from your archive, DELETE the rows with that article_id value.

To get frequencies do SELECT COUNT(*) frequency, word FROM words GROUP BY word ORDER BY 1 DESC LIMIT 50 .

Ecommerce frontend split databases

Memcached + batch data loading + replication + load balancing, any existing solutions?

Database cluster and load balancing

Load Balancing Multiple Django Webservers

Load-Balancing Database in Heroku

Is there a general way to load/store spatial data from/to different kinds of databases?

AWS EB: load balancing and containerized database

MongoDB load balancing in multiple AWS instances

Data analyzing in R with nycfights13 package

load all databases from the source to ADLG2(azure data lake gen2)

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Ecommerce frontend split databases Memcached + batch data loading + replication + load balancing, any existing solutions? Database cluster and load balancing Load Balancing Multiple Django Webservers Load-Balancing Database in Heroku Is there a general way to load/store spatial data from/to different kinds of databases? AWS EB: load balancing and containerized database MongoDB load balancing in multiple AWS instances Data analyzing in R with nycfights13 package load all databases from the source to ADLG2(azure data lake gen2)

Related Tags

databases and frontend: load balancing for analyzing data

Question

1 answers

solution1 0 2022-06-22 19:52:34

solution1
0 2022-06-22 19:52:34