简体   繁体   English

如何在 Scrapy 项目中使用 PyMongo 插入新记录 MongoDB 时删除重复项

[英]How to remove duplicates while inserting new records MongoDB using PyMongo in Scrapy project

In my Scrapy project I'm storing the scraped data in MongoDB using PyMongo.在我的 Scrapy 项目中,我使用 PyMongo 将抓取的数据存储在 MongoDB 中。 There are duplicate records while crawling the web pages in page by page manner, I just want to remove those duplicate records which are with same name at the time of inserting them in to database.以逐页方式抓取网页时存在重复记录,我只想在将它们插入数据库时​​删除那些具有相同名称的重复记录。 Please suggest me the best solution.请建议我最好的解决方案。 Here is my code in "pipelines.py" .这是我在"pipelines.py"代码。 Please guide me how to remove duplicates in the method "process_item" .请指导我如何在"process_item"方法中删除重复项。 I found few queries to remove duplicates from the database in the internet but want a solution in Python.我发现很少有查询可以从互联网上的数据库中删除重复项,但想要一个 Python 解决方案。

from pymongo import MongoClient
from scrapy.conf import settings
class MongoDBPipeline(object):

    def __init__(self):
        connection = MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT'])
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
        self.collection.insert(dict(item))
        return item

It slightly depends on what's in the item but I would use update with upsert like this这稍微取决于item但我会像这样使用带有 upsert 的更新

def process_item(self, item, spider):
    # pseudo example
    _filter = item.get('website')
    update = item.get('some_params')
    if _filter:
        # example
        # self.collection.update_one(
        #     {"website": "abc"}, 
        #     {"div foo": "sometext"}, 
        #     upsert=True
        #     )

        self.collection.update_one(_filter, update, upsert=True)
    return item

You could also play around with filter.你也可以玩弄过滤器。 Basically, you wouldn't even have to remove dupes.基本上,你甚至不必删除欺骗。 It works like if-else condition if applied properly.如果应用得当,它就像if-else条件一样工作。 If the object doesn't exist, create one.如果对象不存在,则创建一个。 Else, update with given properties on given keys.否则,使用给定键上的给定属性进行更新。 Like in a dictionary.就像在字典里一样。 Worst case scenario it updates with the same values.最坏的情况是使用相同的值更新。 So it's faster than inserting, querying and deleting found duplicates.所以它比插入、查询和删除发现的重复项更快。

docs 文档

There's no literal if-else in MongoDB and @tanaydin advice with automatically dropping dupes also works in Python.在 MongoDB 中没有文字if-else并且带有自动删除欺骗的@tanaydin 建议也适用于 Python。 It could be better than my advice, depending on what you really need.这可能比我的建议更好,这取决于你真正需要什么。

If you really want to remove documents given some criteria, then there's delete_one and delete_many in pymongo.如果您确实想根据某些条件删除文档,那么delete_many中有delete_one和 delete_many 。

docs 文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM