What should be the value of max_gram and min_gram in Elasticsearch?

4 min readOct 8, 2018

The Requirement

I was working on elasticsearch and the requirement was to implement like query “%text%” ( like mysql %like% ). We could use wildcard, regex or query string but those are slow. Hence i took decision to use ngram token filter for like query. It was quickly implemented on local and works exactly i want.

The problem

To know the actual behavior, I implemented the same on staging server. I found some problem while we start indexing on staging.

Storage size was directly increase by 8x, Which was too risky. In my previous index the string type was “keyword”. Its took approx 43 gb to store the same data. I implemented a new schema for “like query” with ngram filter which took below storage to store same data.

curl -XGET http://localhost:9200/_cat/indices?vindex       docs.count  pri.store.size
ngram-test  459483245   329.5gb

2. Sometime like query was not behaving properly. Not getting exact output.

Schema

curl -XPUT "localhost:9200/ngram-test?pretty" -H 'Content-Type: application/json' -d'
{
  "settings":{
    "index":{
      "number_of_shards":5,
      "number_of_replicas":0,
      "codec": "best_compression"
    },
    "analysis":{
      "filter":{
        "like_filter":{
          "type":"ngram",
          "min_gram":3,
          "max_gram":10,
          "token_chars":[
            "letter",
            "digit",
            "symbol"
          ]
        }
      },
      "analyzer":{
        "like_analyzer":{
          "type":"custom",
          "tokenizer":"keyword",
          "filter":[
            "lowercase",
            "like_filter"
          ]
        }
      }
    }
  },
  "mappings":{
    "logs":{
      "properties":{
        "email":{
          "type":"keyword",
          "fields":{
            "text":{
              "analyzer":"like_analyzer",
              "search_analyzer":"like_analyzer",
              "type":"text"
            }
          }
        }
      }
    }
  }
}'

Analyzing the behavior of ngram filter

We made one test index and start monitoring by inserting doc one by one.

Storage size problem:

min_gram 1, max_gram 40

curl -X POST "localhost:9200/ngram-test/logs/" -H 'Content-Type: application/json' -d'
{
    "email" : "foo@bar.com"
}'

It produced below terms for inverted index:

Like above, Inserted three more doc.

value           docs.count   pri.store.size
foo@bar.com       1              7kb
foo@bar.com       2              9kb
bar@foo.com       3              12.9kb
user@example.com  4              18kb

If we check closely when we inserted 3rd doc (bar@foo.com) It would not produce many terms because Some term were already created like ‘foo’, ‘bar’, ‘.com’ etc. It will not cause much high storage size.

When we inserted 4th doc (user@example.com), The email address is completely different except “.com” and “@”. It has to produce new term which cause high storage size.

2 . min_gram :3, max_gram: 10

We made same schema with different value of min-gram and max-gram.

It produced below terms for “foo@bar.com”.

We again inserted same doc in same order and we got following storage reading:

value             docs.count   pri.store.size
foo@bar.com           1            4.8kb
foo@bar.com           2            8.6kb
bar@foo.com           3            11.4kb
user@example.com      4            15.8kb

It decreases the storage size by approx 2 kb. On staging with our test data, It drops our storage size from 330 gb to 250 gb.

Like query problem “%search%”

You can search with any term, It will give you output very quickly and accurate.
But If we go to point 2(min-gram :3, max-gram 10), It has not produced term “foo@bar.com” , because we gave max-gram value to 10. It stopped by producing last term “foo@bar.co”. If we directly search “foo@bar.com” as term , It won’t give any result.
Similarly lets take example : There is email address “user@example.com” , If i will search “user@exampl” it won’t return any result with this schema.

Your ngram filter should produced exact term which will come as like (i.e “%text%” here “text” is the term) in your search query.

In our case, We are OK with min gram 3 because our users is not going to search with less than three 3 character and more than 10 character.

if users will try to search more than 10 length, We simply search with full text search query instead of terms. This is one of the way how we tackled. You can find your own way according to your use case.

Decision

We analysis our search query. We finds, what type of like query is coming frequently, what is maximum length of search phrase and minimum length, is it case sensitive? Which is the field, Which having similar data? If data is similar, It will not take more storage.

Like this by analyzing our own data we took decision to make min-gram 3 and max-gram 10 for specific field. You can assign different min and max gram value for different fields by adding more custom analyzers.

Conclusion

You need to analyze your data and their relationship among them. Analyze your query behavior. Know your search query . Once you have all these information, You can take better decision or you can find some better way to solve it.

Above is just example on very low scale but its create large impact on large data. It is all about your use case.

In above example it won’t help if we were using min-gram 1 and max-gram 40, It will give you proper output but it will increase storage of inverted index by producing unused terms, Whereas Same output can be achieve with 2nd approach with low storage.