I have built a simple crawler for one of our clients. I am facing issues with duplicate entries in the database.
Basically what I am doing is looking into a website which has a lot of houses for sale and then pulling from there the address, postcode, town, price and status.
Later when inserting into database I am also generating creation_date
.
The reason for that is that the name CAN be duplicate in case it has been INSERTED at least 2 years ago. So one house can be twice in the database, as long as the creation dates are within a minimum of 2 years range.
<?php
//Comparison to current houses
$query = mysql_query("SELECT street, postcode, town, price, status, creation_time, print_status FROM house"); // Selecting the table
if (!$query) {
die('Invalid query: ' . mysql_error()); // checking for errors
}
while ($row = mysql_fetch_array($query)) {
// $row['street'];
// $row['postcode'];
// $row['town'];
// $row['price'];
// $row['status'];
$creation_time = $row['creation_time'];
$print_status = $row['print_status'];
$c = 0;
foreach ($houses as $house) {
$creation_time_u = strtotime($creation_time); // Makes creation time into Unix
$life_time = strtotime('+2 years', $creation_time_u); // Calculates +2 years from creation time
if (($row['street'] == $house[0]) && ($row['postcode'] == $house[1]) && ($row['town'] == $house[2]) && ($life_time >= $now)) {
unset($houses[$c]); // maybe use implode? When i do unset its leaving the array but the values are gone, so we get an empty row
}
}
$c++;
$houses = array_values($houses); // FIXES BROKEN INDEX AFTER USING UNSET
}
?>
After this has been completed, I insert the new $houses array into the database and then print, which is the next step but kind of irrelevant in this case.
So, i don't know exactly what is going wrong. If I run it twice in a row, it doesn't enter duplicate entries but if I run it the next day or something.
It makes the same entry but double. Here is an example of what i found in the database:
screenshot
So yeah, I have spent too much time looking at this code and I can't figure out why my filter is not working. I expect it has to do with how I am managing time, but not completely sure.
Please advice!
Instead of calculationg the time-interval in php you should select relevant houses in your SQL-query (see DATE_ADD here ):
SELECT
street, postcode, town, price, status, creation_time, print_status
FROM house AS a
JOIN house AS b
ON a.street = b.street
AND a.postcode = b.postcode
AND a.town = b.town
WHERE
a.creation_time <= DATE_ADD(creation_time, INTERVAL 2 YEARS) -- select duplicates
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.