How to integrate ElasticSearch in ASP.NET Core

Blog

Stay updated

Let’s see how adding full-text searches to our ASP.NET Core applications with ElasticSearch

Wednesday, January 22, 2020

I’d bet you’ve certainly been asked to add advanced search features to your Web application, and often a full-text Google-like search.
During the development of an e-commerce of technology, we were asked to allow users to perform advanced research on the products, so that they could find, efficiently and completely, what they were looking for.

We tried the implementation of custom searches, based on the search of a given string on all the fields of an object. To optimize the time, we tried to add a cache layer between service and DB level in order to avoid stressing too much the DB, but we were not satisfied about the results. Then, We searched the market for third-party products that could fit our needs and, after an in-depth analysis, we have chosen to adopt ElasticSearch: a distributed, easily adaptable search engine that manages research and analysis, working on the REST protocol too, facilitating the extrapolation and transformation of data.
Specifically, we are talking about an open source full-text search engine based on Apache Lucene, with which manages the indexing of documents and research. Let’s try to understand what the basic concepts are.

ElasticSearch stores data in one or more indexes. The index of ES is quite similar to the SQL DB one, because we use it to store and read documents.
Document is the main entity of ElasticSearch world. It consists of a set of fields with names and one or more values. Each documents may have a set of fields and no schema or defined structure is given. It’s just a JSON object.
All documents are analyzed before being stored. This analysis process – called mapping – is performed by filtering data content (for instance, removing HTML tags) and tokenize it, so that documents are splitted in tokens.
Each document in ElasticSearch has a type. That allows to store various document types on the same index and get several mappings for several types.

A single instance of ElasticSearch server is called Node. A single node can be enough for a lot of cases of use, but sometimes you need to manage faults or maybe you have too data to manage with a single node. In that case, you can use a multi-node Cluster, a set of nodes working together to manage a heavier load than a single instance is not able to handle. You can configure a cluster so that, even if some nodes are not available, search and management feature are guaranteed.

To let cluster a right functioning, ElasticSearch spreads data over several physical indexes of Apache Lucene. These indexes are called Shard, and spreading process is called sharding. ElasticSearch automatically manages sharding so final user seems to be just a one big index.

Replica is a copy of shard that you can use to query in same mode of original shard.

Replicas provide to relieve the load on a single node that cannot handle all requests and provide greater data security because, if you lose data from the original shard, you can recover them on the replica.
ElasticSearch collects a lot of info about cluster state, index settings, and stores them into the gateway.

Architecturally, ElasticSearch, is based on some simple key concepts:

Default settings and values are such that default configuration is enough to immediately use ElasticSearch;
It works in a distributed way. Nodes become automatically parts of a cluster and, during setup, node tries to join the cluster;
P2P architecture without SPOF (single point of failure). Nodes connect automatically to other machines of cluster to change data and mutual monitoring;
It is easily scalable, either in capability either in data amount, by simply adding new nodes to the cluster;
No restrictions in organizing data in the index. That allow users to modify data model without having any impact in search;
NRT (Near Real Time) search and versioning. It’s impossible to avoid delay and difference between data located on different nodes, due to its distributed nature. For this reason, it provides versioning mechanisms;

When the ElasticSearch node starts, it uses multicast (or unicast, if configured) to find the other nodes in the same cluster and connect to them.

In a cluster, one node is chosen as master node. This node has responsibility to manage cluster state and process to assign shards to nodes. Master node read cluster state and, if needed, starts a recovery mode that allows to know which shards are available ad assign one of them as primary. In this way, the cluster seems to be working correctly even if it doesn’t have full resources available. Then, master node looks for duplicated shard and handles it as replicas.

During the standard functioning, master node checks if all availables node are working correctly. If one of them is not available for a configured range of time, this node is considered as broken and fault tolerance process runs. Main activity of fault tolerance is the balancement of cluster and shards of the broken node, and the assignment of a new node as responsible of those shards. Then, for each primary shard lose, it will be defined a new primary shard chosen between available replicas.

As mentioned, ElasticSearch provides some API REST that can be used by every system able to send HTTP request and receive HTTP response (all browsers and library for most of development frameworks).
ElasticSearch requests are send by some defined URLs containing. eventually, a JSON body. Responses are also JSON documents.

ElasticSearch provides four ways to indexing data.

Index API: it allows to send a document to a defined index;
Bulk API: it allows to send multiple documents over HTTP protocol;
UDP bulk API: it allows to send multiple documents over any protocol (faster but less reliable);
Plugin: executed on the node, they fetch data from external system.

It’s important to remember that indexing is just on the primary shard and not on its replicas, so that, if indexing request is sent to a node that doesn’t contain a primary shard or maybe contains its replica, request is forwarded to the main shard.

Search is performed by using Query API. Using the Query DSL (language based on JSON to build complex query), it’s possible to:

use various types of query, included simple query, phrase, range, boolean, spatial, and other queries;
build complex queries by combining simple queries;
filter documents, by excluding documents that not match selected criteria without influencing their score;
find documents similar to other document;
find suggestion or correction for a given phrase;
find queries that match a given document.

Search is not a simple process with single stage, but, often it’s possible to divide it in two phases: scatter, in which all relevant shards of the index are queried, and gather, in which all precious results are gathered, processed and ordered.

Get your hands dirty!

ES provides several way of use, both cloud and local. If you want to install it on a Windows machine, you need to have an updated version of Java Virtual Machine (https://www.elastic.co/support/matrix#matrix_jvm), then you can download a zip file from ElasticSearch download page (https://www.elastic.co/downloads/elasticsearch) and extract it in a folder on disk, for instance C:\Elasticsearch.

To execute it, you can run C:\Elasticsearch\bin\elasticsearch.bat.

If you want to use ElasticSearch as service, so that you can start or stop it by using Windows tools, you need to add a row in file C:\Elasticsearch\config\jvm.options.
For 32 bit systems you have to type -Xss320k, for 64 bit ones -Xss1m.

After changed this setting you must open command prompt or powershell and execute C:\Elasticsearch\bin\elasticsearch-service.bat. Available commands are install, remove, start, stop and manager.
To create a service, we have to type: C:\Elasticsearch\bin\elasticsearch-service.bat install

To manage the service, we type: C:\Elasticsearch\bin\elasticsearch-service.bat manager that open Elastic Service Manager, a GUI that allows to have custom settings about service and manage its state.

The default cluster.name and the node.name are elasticsearch and your hostname, respectively. If you plan to keep using this cluster or add more nodes, it is a good idea to change these default values to unique names by modifying them in elasticsearch.yml file.

We can verify ElasicSearch correct execution by surfing to http://localhost:9200/. If everything is fine, we get such a result:

To implement our solution, based on .NET Core, we used the NEST package, which we can install through the command:

dotnet add package NEST

NEST allows us to natively use all the ElasticSearch features, both in indexing and searching for documents, and in the administration of nodes and shards.

To manage the NEST plugin, we create the ElasticsearchExtensions class:

public static class ElasticsearchExtensions
{
    public static void AddElasticsearch(this IServiceCollection services, IConfiguration configuration)
    {
        var url = configuration["elasticsearch:url"];
        var defaultIndex = configuration["elasticsearch:index"];
 
        var settings = new ConnectionSettings(new Uri(url))
            .DefaultIndex(defaultIndex);
 
        AddDefaultMappings(settings);
 
        var client = new ElasticClient(settings);
 
        services.AddSingleton(client);
 
        CreateIndex(client, defaultIndex);
    }
 
    private static void AddDefaultMappings(ConnectionSettings settings)
    {
        settings
            DefaultMappingFor<Product>(m => m
                .Ignore(p => p.Price)
                .Ignore(p => p.Quantity)
                .Ignore(p => p.Rating)
            );
    }
 
    private static void CreateIndex(IElasticClient client, string indexName)
    {
        var createIndexResponse = client.Indices.Create(indexName,
            index => index.Map<Product>(x => x.AutoMap())
        );
    }
}

in which we find the configurations and mappings of the object, in our case the Product class. In this class which we have decided to ignore to store, in the indexing phase, price, quantity and rating.

This class is called in Startup.cs through the instruction:

public void ConfigureServices(IServiceCollection services)
{
    // ...
    services.AddElasticsearch(Configuration);
}

that allows us to load all the settings at startup, modifying them in the elasticsearch section of the appsettings.json file, in which we insert the following line:

"elasticsearch": {
        "index": "products",
        "url": "http://localhost:9200/"
}

Index represents the default index chosen to store our documents and url is the address of our instance of ElasticSearch.

Our Product object is defined as follows:

public class Product
{
public int Id { get; set; }
public string Ean { get; set; }
public string Name { get; set; }
public string Description { get; set; }
public string Brand { get; set; }
public string Category { get; set; }
public string Price { get; set; }
public int Quantity { get; set; }
public float Rating { get; set; }
public DateTime ReleaseDate { get; set; }
}

The products can be indexed, as mentioned before, both individually and in lists.
In our product service we implemented both ways:

public async Task SaveSingleAsync(Product product)
{
    if (_cache.Any(p => p.Id == product.Id))
    {
        await _elasticClient.UpdateAsync<Product>(product, u => u.Doc(product));
    }
    else
    {
        _cache.Add(product);
        await _elasticClient.IndexDocumentAsync(product);
    }
}
 
public async Task SaveManyAsync(Product[] products)
{
    _cache.AddRange(products);
    var result = await _elasticClient.IndexManyAsync(products);
    if (result.Errors)
    {
        // the response can be inspected for errors
        foreach (var itemWithError in result.ItemsWithErrors)
        {
            _logger.LogError("Failed to index document {0}: {1}",
                itemWithError.Id, itemWithError.Error);
        }
    }
}
 
public async Task SaveBulkAsync(Product[] products)
{
    _cache.AddRange(products);
    var result = await _elasticClient.BulkAsync(b => b.Index("products").IndexMany(products));
    if (result.Errors)
    {
        // the response can be inspected for errors
        foreach (var itemWithError in result.ItemsWithErrors)
        {
            _logger.LogError("Failed to index document {0}: {1}",
                itemWithError.Id, itemWithError.Error);
        }
    }
}

where we used a _cache array to have an further cache of product list.
For the multiple mode, we implemented the bulk version too, which allows us to index a large amount of documents in much shorter times, and we have managed any errors in insertion with logs.
Note that the SaveSingleAsync method manages both the insertion and modification of the document through a check on our cache array.

For document deletion, we have implemented a DeleteAsync method:

public async Task DeleteAsync(Product product)
{
    await _elasticClient.DeleteAsync<Product>
(product);
 
    if (_cache.Contains(product))
    {
        _cache.Remove(product);
    }
}

The GetSearchUrl method allows us to get the url to manage paging.

For development purposes, we have implemented the ReIndex method, which allows us to delete all the documents on the index and import them again one by one. It can be useful for importing lists of existing and not loaded documents.

[Route("/search")]
public async Task&lt;IActionResult>Find(string query, int page = 1, int pageSize = 5)
{
    var response = await _elasticClient.SearchAsync&lt;Product>
(
        s =>s.Query(q => q.QueryString(d => d.Query(query)))
            .From((page - 1) * pageSize)
            .Size(pageSize));
 
    if (!response.IsValid)
    {
        // We could handle errors here by checking response.OriginalException 
        //or response.ServerError properties
        _logger.LogError("Failed to search documents");
        return View("Results", new Product[] { });
    }
 
    if (page > 1)
    {
        ViewData["prev"] = GetSearchUrl(query, page - 1, pageSize);
    }
 
    if (response.IsValid &amp;&amp; response.Total > page * pageSize)
    {
        ViewData["next"] = GetSearchUrl(query, page + 1, pageSize);
    }
 
    return View("Results", response.Documents);
}
 
private static string GetSearchUrl(string query, int page, int pageSize)
{
    return $"/search?query={Uri.EscapeDataString(query ?? "")}&amp;page={page}&amp;pagesize={pageSize}/";
}

The GetSearchUrl method allows us to get the URL to manage pagination.

For development purposes, we have implemented the ReInlex method, which allows us to delete all documents on the index and import them again one by one. It can be useful for importing lists of existing and not loaded documents.

//Only for development purpose
[HttpGet("/search/reindex")]
public async Task<IActionResult>ReIndex()
{
    await _elasticClient.DeleteByQueryAsync<Product>(q => q.MatchAll());
 
    var allProducts = (await _productService.GetProducts(int.MaxValue)).ToArray();
 
    foreach (var product in allProducts)
    {
        await _elasticClient.IndexDocumentAsync(product);
    }
 
    return Ok($"{allProducts.Length} product(s) reindexed");
}

For example purposes, we created an interface that allows us to add N dynamically generated products, through the Bogus plugin, and manage the CRUD of the products.

After running the project, we get the following screen:

If we try to add, for instance, 10 products to our index, typing 10 in the text box and clicking on Import Documents button, we can view the results using the search box, but also directly from the browser, surfing to page http://localhost:9200/products/_search, where we will get such a result:

Il codice utilizzato in questo articolo è disponibile qui.

Alla prossima!

Written by

Enrico Bencivenga

See author's posts

Blog

Get your hands dirty!

Enrico Bencivenga

Tag

News & Events

Discover more from Blexin