简体繁体 English

数据库和前端：分析数据的负载平衡

[英]databases and frontend: load balancing for analyzing data

原文 2022-06-22 11:23:46 5 1 database/ performance/ text/ load

I have a scrapper which gets news-articles over the day by different sources.我有一个抓取工具，可以在一天内从不同来源获取新闻文章。

I want to display data like 'most common words in the last 30 days (in source X)' on my page.我想在我的页面上显示“过去 30 天（源 X 中）最常用的单词”之类的数据。 For now I have saved the articles to my database consisting of the timestamp the article was released and a string of the content.现在，我已将文章保存到我的数据库中，其中包含文章发布的时间戳和内容字符串。 With a few datasets this works fine, but I do no understand how to balance the load, that the front end has most flexibility but not too much data to count.使用一些数据集可以正常工作，但我不明白如何平衡负载，前端具有最大的灵活性但没有太多数据可以计算。

I thought you could run a script, which takes all the articles from one day and create a new tables containing each word with its count.我认为您可以运行一个脚本，该脚本从一天中获取所有文章并创建一个包含每个单词及其计数的新表。 I came across two points here:我在这里遇到了两点：

1 - How do I create a table for this? 1 - 我如何为此创建一个表？ Since every article has different length and different sets of words I would need a table with as many fields, as the number of words in the longest article.由于每篇文章都有不同的长度和不同的单词集，我需要一个包含与最长文章中的单词数一样多的字段的表格。 I could say I will only save the first 20, but I don't really like the idea.我可以说我只会保存前 20 个，但我不太喜欢这个主意。

2 - If the script takes all the articles from one day and calculates the word_counts, I have a minimum resolution of 1 day. 2 - 如果脚本从一天中获取所有文章并计算 word_counts，我的最小分辨率为 1 天。 So I won't be able to differentiate any further.所以我将无法进一步区分。 I chose the script to run for each day to reduce the data that I will need to send to the front on demand.我选择了每天运行的脚本，以减少我需要按需发送到前台的数据。

1 个解决方案

Don't create a table with a separate column for each of the first 20 words.不要为前 20 个单词中的每一个创建一个包含单独列的表。 Please.请。 I beg you.我祈求你。 Just don't.只是不要。

Two possible approaches.两种可能的方法。

Use a fulltext search feature in your DBMS.在您的 DBMS 中使用全文搜索功能。 You didn't tell us which one you use, so it's hard to be more specific.你没有告诉我们你使用的是哪一个，所以很难更具体。
Preprocess: Create a table with columns article_id , word_number , and word .预处理：创建一个包含article_id 、 word_number和word列的表。 This table will have a large number of rows, one for each word in each article.该表将有大量行，每篇文章中的每个单词都有一个行。 But that's OK.但这没关系。 SQL databases are made for handling vast tables of simple rows. SQL 数据库用于处理包含简单行的大量表。

The unique key on the table contains two columns: article_id and word_number .表上的唯一键包含两列： article_id和word_number 。 A non-unique key for searching should contain word , article_id , word_number .用于搜索的非唯一键应包含word 、 article_id 、 word_number 。

When you receive an incoming article, assign it an article_id number.当您收到传入的文章时，为其分配一个article_id编号。 Then break it up into words and insert each word into the table.然后将其分解为单词并将每个单词插入表格中。

When you search for a word do SELECT article_id FROM words WHERE word=?当您搜索一个单词时，请执行SELECT article_id FROM words WHERE word=? . . Fast.快速地。 And you can use SQL set manipulation to do more complex searches.并且您可以使用 SQL 集合操作来执行更复杂的搜索。

When you remove an article from your archive, DELETE the rows with that article_id value.从存档中删除文章时，删除具有该article_id值的行。

To get frequencies do SELECT COUNT(*) frequency, word FROM words GROUP BY word ORDER BY 1 DESC LIMIT 50 .要获得频率，请执行SELECT COUNT(*) frequency, word FROM words GROUP BY word ORDER BY 1 DESC LIMIT 50 。