简体   繁体   中英

Django/PostgreSQL Full Text Search - Different search results when using SearchVector versus SearchVectorField on AWS RDS PostgreSQL

I'm trying to use the Django SearchVectorField to support full text search. However, I'm getting different search results when I use the SearchVectorField on my model vs. instantiating a SearchVector class in my view. The problem is isolated to an AWS RDS PostgreSQL instance. Both perform the same on my laptop.

Let me try to explain it with some code:

# models.py

class Tweet(models.Model):
    def __str__(self):
        return self.tweet_id

    tweet_id = models.CharField(max_length=25, unique=True)
    text = models.CharField(max_length=1000)
    text_search_vector = SearchVectorField(null=True, editable=False)

    class Meta:
        indexes = [GinIndex(fields=['text_search_vector'])]

I've populated all rows with a search vector and have established a trigger on the database to keep the field up to date.

# views.py

query = SearchQuery('chance')
vector = SearchVector('text')

on_the_fly = Tweet.objects.annotate(
    rank=SearchRank(vector, query)
).filter(
    rank__gte=0.001
)

from_field = Tweet.objects.annotate(
    rank=SearchRank(F('text_search_vector'), query)
).filter(
    rank__gte=0.001
)

# len(on_the_fly) == 32
# len(from_field) == 0

The on_the_fly queryset, which uses a SearchVector instance, returns 32 results. The from_field queryset, which uses the SearchVectorField , returns 0 results.

The empty result prompted me to drop into the shell to debug. Here's some output from the command line in my python manage.py shell environment:

>>> qs = Tweet.objects.filter(
...     tweet_id__in=[949763170863865857, 961432484620787712]
... ).annotate(
...     vector=SearchVector('text')
... )
>>> 
>>> for tweet in qs:
...     print(f'Doc text: {tweet.text}')
...     print(f'From db:  {tweet.text_search_vector}')
...     print(f'From qs:  {tweet.vector}\n')
... 
Doc text: @Espngreeny Run your 3rd and long play and  compete for a chance on third down.
From db:  '3rd':4 'chanc':12 'compet':9 'espngreeni':1 'long':6 'play':7 'run':2 'third':14
From qs:  '3rd':4 'a':11 'and':5,8 'chance':12 'compete':9 'down':15 'espngreeny':1 'for':10 'long':6 'on':13 'play':7 'run':2 'third':14 'your':3

Doc text: No chance. It was me complaining about Girl Scout cookies. <url-removed-for-stack-overflow>
From db:  '/aggcqwddbh':13 'chanc':2 'complain':6 'cooki':10 'girl':8 'scout':9 't.co':12 't.co/aggcqwddbh':11
From qs:  '/aggcqwddbh':13 'about':7 'chance':2 'complaining':6 'cookies':10 'girl':8 'it':3 'me':5 'no':1 'scout':9 't.co':12 't.co/aggcqwddbh':11 'was':4

You can see that the search vector looks very different when comparing the value from the database to the value that's generated via Django.

Does anyone have any ideas as to why this would happen? Thanks!

SearchQuery translates the terms the user provides into a search query object that the database compares to a search vector. By default, all the words the user provides are passed through the Stemming algorithms , and then it looks for matches for all of the resulting terms. there two issue need to be solved first gave stemming algorithm information about language.

query = SearchQuery('chance' , config="english")

and second is replace this line

rank=SearchRank(F('text_search_vector'), query)

with

rank=SearchRank('text_search_vector', query)

about the missing word in text_search_vector this is standard procedure of Stemming algorithms to remove common word known as stop word

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM