I'm trying to get some specific documents in my collection. I want documents that have a substring in one filed of my database (display_url) and that also look for some key words that must have in another field (edge_media_to_caption.edges.node.text). The first field is an url so I need to use wildcard, the only way that seems to work is using this signal: .*
However I'm having problems with the second part of my match where I use $in I think it is not working. This second field is a string field with text>
So I need to get documents that have a regex expression that i give (I tested this part alone and is working) and that also have at least one of the words ['. corona. ','. virus. ','. vírus. ','. covid. ','. pandemia. ','. pândemia. '] in the text.
client = MongoClient('localhost', 27017)
db = client.basededados
collection = getattr(db, pdados)
pipeline= [{'$project': {"_id": True,
'legenda': '$edge_media_to_caption.edges.node.text',
'data': '$taken_at_timestamp',
'hash': '$tags',
'id' :'$display_url'}},
{'$match': {'$and': [{"id": {"$regex": '/%s/' % nitem[0]}},
{"legenda": {"$in": ['.*corona.*','.*virus.*','.*vírus.*','.*covid.*','.*pandemia.*','.*pândemia.*']}}
]}}
]
To wildcard match a string, use a regex . In pure Mongo:
{$in: [/\.corona\./, ...]}
In pymongo, you can use native Python regexen:
import re
...
{'$in': [re.compile(r'\.corona\.'), ...]}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.