Postgres Full Text Search

July 12, 2022

While working on search functionality on Famnom, I investigated three options:

Standard textual queries allows for selection of objects with matching operations. This works for smaller datasets (like kitchen search on Famnom) with case-insensitive match and unaccented comparisons.

>>> UserIngredient.objects.filter(name__unaccent__icontains='Milk')
[<UserIngredient: 3.25% fat milk>, <UserIngredient: MILK CHOCOLATE>]

Textual queries are less useful when searching over large datasets, with large text fields. Django Haystack is an excellent plugin that provides modular search with the ability to plugin various search backends such as Solr, Elasticsearch, etc behind the same API. However, there are additional backends to be understood, and deployed, with custom schema formats.

Postgres full text search is a surprisingly robust alternative for document-based search if the data is already in Postgres. Search indexing and query lookups are fast, without additional backend overhead.

1. Materialize the search index (GinIndex) as a SearchVector field for performant lookups.

indexes = [
  GinIndex(fields=["search_vector"]),
  models.Index(name="search_result_gtinupc_idx", fields=["gtin_upc"]),
]

def update_search_vector() -> int:
    """Updates the search results model search vector field, and returns the number of rows matched."""
    return update(search_vector=get_search_vector())

2. SearchQuery and SearchRank allow for multiple tokenization and ranking functions for matches between the vector and the query. In addition, further filtering is possible through trigram similarity.

search_query = SearchQuery(query, config="english", search_type="websearch")
search_vectors: SearchVector = (
        SearchVector("name", config="english")
        + SearchVector("brand_name", config="english")
        + SearchVector("brand_owner", config="english")
        + SearchVector("subbrand_name", config="english")
        + SearchVector("gtin_upc", config="english")
)

search_rank: SearchRank = SearchRank(search_vectors, search_query, cover_density=True)
trigram_similarity: TrigramSimilarity = TrigramSimilarity("name", query)
qs: QuerySet[search_result.SearchResult] = (
	search_result.SearchResult.objects.filter(search_vector=search_query)
        .annotate(
            rank0=search_rank,
            rank1=Case(
                When(
                    Q(source_type=constants.DBFoodSourceType.USDA)
                    & Q(source_sub_type=constants.DBFoodSourceSubType.USDA_FOUNDATION_FOOD),
                    then=16,
                ),
                When(source_type=constants.DBFoodSourceType.USER, then=8),
                When(
                    Q(source_type=constants.DBFoodSourceType.USDA)
                    & Q(source_sub_type=constants.DBFoodSourceSubType.USDA_SR_LEGACY_FOOD),
                    then=4,
                ),
                When(
                    Q(source_type=constants.DBFoodSourceType.USDA)
                    & Q(source_sub_type=constants.DBFoodSourceSubType.USDA_SURVEY_FNDDS_FOOD),
                    then=2,
                ),
                When(
                    Q(source_type=constants.DBFoodSourceType.USDA)
                    & Q(source_sub_type=constants.DBFoodSourceSubType.USDA_BRANDED_FOOD),
                    then=1,
                ),
                default=0,
            ),
            similarity=trigram_similarity,
        )
        .order_by("-rank1", "-rank0", "-similarity")
    )