简体   繁体   中英

storing and updating scraped data in database

Beginner - PHP - Scraping - Databases - Design question ahead:

I built a PHP script that scrapes (using curl) articles off a website, I styled it and added html tags so its presentable and uploaded it to a shared hosting through cPanel.

The scraping is done using php functions curl and preg_match_all. Every page scraped has 17 articles, and so if I scrape 100 pages its 170 articles. I'm scraping only the article headline, url, article summary and date published, so not so much info per article (not the content).

My website presents the article headline (which links to the original source), and article summary. I also uses the article published date as a string I extract which I parse and show the articles in monthly blocks (December 2019, November 2019, .. and so forth).

Loading time of the website is terrible. Every time you open the website, the script scrapes 100 pages and it takes a lot of time. Even when decreasing the amount of pages to scrape to 30 it takes a long time to load.

Now, the issue I'm dealing with, as inexperienced developer is how to go about designing a solution for this that I can implement in the end in my shared hosting (which differs from VPS in the amount of control I have..)

First thought that came to my mind, is that I should store the scraped data in a mysql database, and update it regularly (with a cron job?), would that work, is it a feasible solution that is easy to implement?

What is the right design flow for a solution that a beginner can/should implement here?

Would it be something like:

  1. Scrape the data first time and store in database.
  2. write a script that once a day (or more) collects new articles from only the first page without creating duplicates (probably need to compare headlines strings with the last DB input until something like headlineString1 == headlineString2, and then stop inserting into the DB).

Your thoughts and suggestions would be highly appreciated. BTW, I'm also trying to redo the website more professionally in Laravel currently.

First running the scraping script on some time interval and fetching already scraped content from the database to display it will be obviously faster.

But first and foremost since you are gonna use laravel then you can easily take advantage of its task scheduling feature and separate your scraping logic from the presentation logic.

So let's say run the scraping logic for every five minutes or every minute (if you have the proper amount of resource allocated to run cron) . And all of the presentation logic is completely separated from your scraping logic then it will help on faster website load speed. Since your presentation logic does not need to rely of scraping logic to load the page.

And checking the headline for duplicate content might work or if the source URL will not change often then checking URL for duplicate will work too.

I don't have much knowledge of hosting. So I cannot give any advice on it and how much of this can be implemented in your shared hosting.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM