Price Optimized Vector Database: Introduction to Amazon OpenSearch Service quantization methods

January 10, 2025

17

The rise of generative AI functions has heightened the need to implement semantic search and pure language search. These superior search options assist discover and retrieve conceptually related paperwork from enterprise content material repositories to function prompts for generative AI fashions. Uncooked knowledge inside numerous supply repositories within the type of textual content, photos, audio, video, and so forth are transformed, with the assistance of embedding fashions, to a normal numerical illustration known as vectors that powers the semantic and pure language search. As organizations harness extra refined giant language and foundational fashions to energy their generative AI functions, supplemental embedding fashions are additionally evolving to deal with giant, high-dimension vector embedding. Because the vector quantity expands, there’s a proportional improve in reminiscence utilization and computational necessities, leading to greater operational prices. To mitigate this challenge, numerous compression methods can be utilized to optimize reminiscence utilization and computational effectivity.

Quantization is a lossy knowledge compression method aimed to decrease computation and reminiscence utilization resulting in decrease prices, particularly for high-volume knowledge workloads. There are numerous methods to compress knowledge relying on the kind and quantity of the info. The same old method is to map infinite values (or a comparatively giant record of finites) to smaller extra discrete values. Vector compression could be achieved by two major methods: product quantization and scalar quantization. Within the product quantization method, the unique vector dimension array is damaged into a number of sub-vectors and every sub-vector is encoded into a set variety of bits that symbolize the unique vector. This methodology requires that you simply solely retailer and search throughout the encoded sub-vector as a substitute of the unique vector. In scalar quantization, every dimension of the enter vector is mapped from a 32-bit floating-point illustration to a smaller knowledge kind.

Amazon OpenSearch Service, as a vector database, helps scalar and product quantization methods to optimize reminiscence utilization and cut back operational prices.

OpenSearch as a vector database

OpenSearch is a distributed search and analytics service. The OpenSearch k-nearest neighbor (k-NN) plugin means that you can index, retailer, and search vectors. Vectors are saved in OpenSearch as a 32-bit float array of kind knn_vector and that helps as much as 16,000 dimensions per vector.

OpenSearch makes use of approximate nearest neighbor search to supply scalable vector search. The approximate k-NN algorithm retrieves outcomes primarily based on an estimation of the closest vectors to a given question vector. Two major strategies for performing approximate k-NN are the graph-based Hierarchical Navigable Small-World (HNSW) and the cluster-based Inverted File (IVF). These knowledge constructions are constructed and loaded into reminiscence through the preliminary vector search operation. As vector quantity grows, each the info constructions and related reminiscence necessities for search operations scale proportionally.

For instance, every HNSW graph with 32-bit float knowledge takes roughly 1.1 * (4 * d + 8 * m) * num_vectors bytes of reminiscence. Right here, num_vectors represents the whole amount of vectors to be listed, d is the variety of dimensions decided by the embedding mannequin you employ to generate the vectors and m is the variety of edges within the HSNW graphs, an index parameter that may be managed to tune efficiency. Utilizing this system, reminiscence necessities for vector storage for a configuration of 384 dimensions and an m worth of 16 can be:

1 million vectors: 1.830 GB (1.1 * (4 * 384 + 8 * 16) * 1000,000 bytes)
1 billion vectors: 1830 GB (1.1 * (4 * 384 + 8 * 16) * 1,000,000,000 bytes)

Though approximate nearest neighbor search could be optimized to deal with huge datasets with billions of vectors effectively, the reminiscence necessities for loading 32-bit full-precision vectors to reminiscence through the search course of can turn out to be prohibitively expensive. To mitigate this, OpenSearch service helps the next 4 quantization methods.

Binary quantization
Byte quantization
FP16 quantization
Product quantization

These methods fall throughout the broader class of scalar and product quantization that we mentioned earlier. On this submit, you’ll be taught quantization methods for optimizing vector workloads on OpenSearch Service, specializing in reminiscence discount and cost-efficiency. It introduces the brand new disk-based vector search strategy that permits environment friendly querying of vectors saved on disk with out loading them into reminiscence. The strategy integrates seamlessly with quantization methods, that includes key configurations such because the on_disk mode and compression_level parameter. These settings facilitate built-in, out-of-the-box scalar quantization on the time of indexing.

Binary quantization (as much as 32x compression)

Binary quantization (BQ) is a kind of scalar quantization. OpenSearch leverages FAISS engine’s binary quantization, enabling as much as 32x compression throughout indexing. This system reduces the vector dimension from the default 32-bit float to a 1-bit binary by compressing the vectors right into a 0s and 1s. OpenSearch helps indexing, storing and looking binary vectors. You may as well select to encode every vector dimension utilizing 1, 2, or 4 bits, relying upon the specified compression issue as proven within the instance beneath. The compression issue could be adjusted utilizing bits settings. A price of two yields 16x compression, whereas 4 leads to 8x compression. The default setting is 1. In binary quantization, the coaching is dealt with natively on the time of indexing, permitting you to keep away from a further preprocessing step.

To implement binary quantization, outline the vector kind as knn_vector and specify the encoder identify as binary with the specified variety of encoding bits. Observe, the encoder parameter refers to a technique used to compress vector knowledge earlier than storing it within the index. Optimize efficiency by utilizing space_type, m, and ef_construction parameters. See the OpenSearch documentation for details about the underlying configuration of the approximate k-NN.

PUT my-vector-index
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "my_vector_field": {
        "kind": "knn_vector",
        "dimension": 8,
        "methodology": {
          "identify": "hnsw",
          "engine": "faiss",
          "space_type": "l2",
          "parameters": {
            "m": 16,
            "ef_construction": 512,
            "encoder": {
              "identify": "binary",
              "parameters": {
                "bits": 1
              }
            }
          }
        }
      }
    }
  }
}

Reminiscence necessities for implementing binary quantization with FAISS-HNSW:

1.1 * (bits * (d/8)+ 8 * m) * num_vectors bytes.

Compression	Encoding bits	Reminiscence required for 1 billion vector with d=384 and m=16 (in GB)
32x	1	193.6
16x	2	246.4
8x	4	352.0

For detailed implementation steps on binary quantization, see the OpenSearch documentation.

Byte-quantization (4x compression)

Byte quantization compresses 32-bit floating-point dimensions to 8-bit integers, starting from –128 to +127, decreasing reminiscence utilization by 75%. OpenSearch helps indexing, storing, and looking byte vectors, which have to be transformed to 8-bit format previous to ingestion. To implement byte vectors, specify the k-NN vector discipline data_type as byte within the index mapping. This function is appropriate with each Lucene and FAISS engines. An instance of making an index for byte-quantized vectors follows.

PUT /my-vector-index
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
      "my_vector": {
        "kind": "knn_vector",
        "dimension": 3,
        "data_type": "byte",
        "space_type": "l2",
        "methodology": {
          "identify": "hnsw",
          "engine": "faiss",
          "parameters": {
            "ef_construction": 100,
            "m": 16
          }
        }
      }
    }
  }
}

This methodology requires ingesting a byte-quantized vector into OpenSearch for direct storage within the k-NN vector discipline (of byte kind). Nonetheless, the not too long ago launched disk-based vector search function eliminates the necessity for exterior vector quantization. This function shall be mentioned intimately later on this weblog.

Reminiscence necessities for implementing byte quantization with FAISS-HNSW:

1.1 * (1 * d + 8 * m) * num_vectors bytes.

For detailed implementation steps, see to the OpenSearch documentation. For efficiency metrics concerning accuracy, throughput, and latency, see Byte-quantized vectors in OpenSearch.

FAISS FP16 quantization (2x compression)

FP16 quantization is a way that makes use of 16-bit floating-point scalar illustration, decreasing the reminiscence utilization by 50%. Every vector dimension is transformed from 32-bit to 16-bit floating-point, successfully halving the reminiscence necessities. The compressed vector dimensions have to be within the vary [–65504.0, 65504.0]. To implement FP16 quantization, create the index with the k-NN vector discipline and configure the next:

Set k-NN vector discipline methodology and engine to HNSW and FAISS, respectively.
Outline encoder parameter and set identify to sq and kind to fp16.

Upon importing 32-bit floating-point vectors to OpenSearch, the scalar quantization FP16 (SQfp16) mechanically quantizes them to 16-bit floating-point vectors throughout ingestion and shops them within the vector discipline. The next instance demonstrates the creation of the index for quantizing and storing FP16-quantized vectors.

PUT /my-vector-index
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
      "my_vector1": {
        "kind": "knn_vector",
        "dimension": 3,
        "space_type": "l2",
        "methodology": {
          "identify": "hnsw",
          "engine": "faiss",
          "parameters": {
            "encoder": {
              "identify": "sq",
              "parameters": {
                "kind": "fp16",
                "clip": true
              }
            },
            "ef_construction": 256,
            "m": 8
          }
        }
      }
    }
  }
}

Reminiscence necessities for implementing FP16 quantization with FAISS-HNSW:

(1.1 * (2 * d + 8 * m) * num_vectors) bytes.

The previous FP16 instance introduces an non-obligatory Boolean parameter known as clip, which defaults to false. When false, vectors with out-of-range values (values not between –65504.0 and +65504.0) are rejected. Setting clip to true allows rounding of out-of-range vector values to suit throughout the supported vary. For detailed implementation steps, see the OpenSearch documentation. For efficiency metrics concerning accuracy, throughput, and latency, see Optimizing OpenSearch with Faiss FP16 scalar quantization: Enhancing reminiscence effectivity and cost-effectiveness.

Product quantization

Product quantization (PQ) is a sophisticated dimension-reduction method that provides considerably greater ranges of compression. Whereas standard scalar quantization strategies usually obtain as much as 32x compression, PQ can present compression ranges of as much as 64x, making it a extra environment friendly answer for optimizing storage and price. OpenSearch helps PQ with each IVF and HNSW methodology from FAISS engine. Product quantization partitions vectors into m sub-vectors, every encoded with a bit rely decided by the code measurement. The ensuing vector’s reminiscence footprint is m * code_size bits.

FAISS product quantization entails three key steps:

Create and populate a coaching index to construct the PQ mannequin, optimizing for accuracy.
Execute the _train API on the coaching index to generate the quantizer mannequin.
Assemble the vector index, configuring the kNN discipline to make use of the ready quantizer mannequin.

The next instance demonstrates the three steps to organising product quantization.

Step1: Create the coaching index. Populate the coaching index with an applicable dataset, ensuring of dimensional alignment with train-index specs. Observe that the coaching index requires a minimal of 256 paperwork.

PUT /train-index
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 2
  },
  "mappings": {
    "properties": {
      "train-field": {
        "kind": "knn_vector",
        "dimension": 4
      }
    }
  }
}

Step2: Create a quantizer mannequin known as my-model by operating the _train API on the coaching index you simply created. Observe that the encoder with identify outlined as pq facilitates native vector quantization. Different parameters for encoder embody code_size and m. FAISS-HNSW requires a code_size of 8 and a coaching dataset of a minimum of 256 (2^code_size) paperwork. For detailed parameter specs, see the PQ parameter reference.

POST /_plugins/_knn/fashions/my-model/_train
{
  "training_index": "train-index",
  "training_field": "train-field",
  "dimension": 4,
  "description": "My take a look at mannequin description",
  "methodology": {
    "identify": "hnsw",
    "engine": "faiss",
    "parameters": {
      "encoder": {
        "identify": "pq", 
         "parameters": {
           "code_size":8,
           "m":2
         }
      },
      "ef_construction": 256,
      "m": 8
    }
  }
}

Step3: Map the quantizer mannequin to your vector index.

PUT /my-vector-index
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 2,
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "target-field": {
        "kind": "knn_vector",
        "model_id": "my-model"
      }
    }
  }
}

Ingest the whole dataset into the newly created index, my-vector-index. The encoder will mechanically course of the incoming vectors, making use of encoding and quantization primarily based on the compression parameters (code_size and m) specified within the quantizer mannequin configuration.

Reminiscence necessities for implementing product quantization with FAISS-HNSW:

1.1*(((code_size / 8) * m + 24 + 8 * m) * num_vectors bytes. Right here the code_size and m are parameters throughout the encoder parameter, num_vectors are the whole variety of vectors.

Throughout quantization, every of the coaching vectors is damaged all the way down to a number of sub-vectors or sub-spaces, outlined by a configurable worth m. The variety of bits to encode every of the sub-vector is managed by parameter code_size. Every of the sub-vectors is then compressed or quantized individually by operating the k-means clustering with the worth ok outlined as 2^code_size. On this method, the vector is compressed roughly by m * code_size bits.

For detailed implementation tips and understanding of the configurable parameters throughout product quantization, see the OpenSearch documentation. For efficiency metrics concerning accuracy, throughput and latency utilizing FAISS IVF for PQ, see Select the k-NN algorithm to your billion-scale use case with OpenSearch.

Disk-based vector search

Disk-based vector search optimizes question effectivity by utilizing compressed vectors in reminiscence whereas sustaining full-precision vectors on disk. This strategy allows OpenSearch to carry out searches throughout giant vector datasets with out the necessity to load whole vectors into reminiscence, thus bettering scalability and useful resource utilization. Implementation is achieved by two new configurations at index creation: mode and compression stage. As of OpenSearch 2.17, the mode parameter could be set to both in_memory or on_disk throughout indexing. The beforehand mentioned strategies default to an in-memory mode. On this configuration, the vector index is constructed utilizing both a graph (HNSW) or bucket (IVF) construction, which is then loaded into native reminiscence throughout search operations. Whereas providing glorious recall, this strategy may impression reminiscence utilization, and scalability for prime quantity vector workload.

The on_disk mode optimizes vector search effectivity by storing full-precision vectors on disk whereas utilizing real-time, native quantization throughout indexing. Coupled with adjustable compression ranges, this strategy permits solely compressed vectors to be loaded into reminiscence, thereby bettering reminiscence and useful resource utilization and search efficiency. The next compression ranges correspond to numerous scalar quantization strategies mentioned earlier.

32x: Binary quantization (1-bit dimensions)
4x: Byte and integer quantization (8-bit dimensions)
2x: FP16 quantization (16-bit dimensions)

This methodology additionally helps different compression ranges corresponding to 16x and 8x that aren’t out there with the in-memory mode. To allow disk-based vector search, create the index with mode set to on_disk as proven within the following instance.

PUT /my-vector-index
{
  "settings" : {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "my_vector_field": {
        "kind": "knn_vector",
        "dimension": 8,
        "space_type": "innerproduct",
        "data_type": "float",
        "mode": "on_disk"
      }
    }
  }
}

Configuring simply the mode as on_disk employs the default configuration, which makes use of the FAISS engine and HNSW methodology with a 32x compression stage (1-bit, binary quantization). The ef_construction to optimize index time latency defaults to 100. For extra granular fine-tuning, you’ll be able to override these k-NN parameters as proven within the instance that follows.

PUT /my-vector-index
{
  "settings" : {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "my_vector_field": {
        "kind": "knn_vector",
        "dimension": 8,
        "space_type": "innerproduct",
        "data_type": "float",
        "mode": "on_disk",
        "compression_level": "16x",
        "methodology": {
          "identify": "hnsw",
          "engine": "faiss",
          "parameters": {
            "ef_construction": 512
          }
        }
      }
    }
  }
}

As a result of quantization is a lossy compression method, greater compression ranges usually end in decrease recall. To enhance recall throughout quantization, you’ll be able to configure the disk-based vector search to run in two phases utilizing the search time configuration parameter ef_search and the oversample_factor as proven within the following instance.

GET my-vector-index/_search
{
  "question": {
    "knn": {
      "my_vector_field": {
        "vector": [1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5],
        "ok": 5,
        "method_parameters": {
            "ef_search": 512
        },
        "rescore": {
            "oversample_factor": 10.0
        }
      }
    }
  }
}

Within the first section, oversample_factor * ok outcomes are retrieved from the quantized vectors in reminiscence and the scores are approximated. Within the second section, the full-precision vectors of these oversample_factor * ok outcomes are loaded into reminiscence from disk, and scores are recomputed towards the full-precision question vector. The outcomes are then diminished to the highest ok.

The oversample_factor for rescoring is decided by the configured dimension and compression stage at indexing. For dimensions beneath 1,000, the issue is fastened at 5. For dimensions exceeding 1,000, the default issue varies primarily based on the compression stage, as proven within the following desk.

Compression stage	Default oversample_factor for rescoring
32x (default)	3
16x	2
8x	2
4x	No default rescoring
2x	No default rescoring

As beforehand mentioned, the oversample_factor could be dynamically adjusted at search time. This worth presents a essential trade-off between accuracy and search effectivity. Whereas the next issue improves accuracy, it proportionally will increase reminiscence utilization and reduces search throughput. See the OpenSearch documentation to be taught extra about disk-based vector search and perceive the precise utilization for oversample_factor.

Efficiency evaluation of quantization strategies: Reviewing reminiscence, recall, and question latency.

The OpenSearch documentation on approximate k-NN search offers a place to begin for implementing vector similarity search. Moreover, Select the k-NN algorithm to your billion-scale use case with OpenSearch gives worthwhile insights into designing environment friendly vector workloads for dealing with billions of vectors in manufacturing environments. It introduces product quantization methods as a possible answer to scale back reminiscence necessities and related prices by cutting down the reminiscence footprint.

The next desk illustrates the reminiscence necessities for storing and looking by 1 billion vectors utilizing numerous quantization methods. The desk compares the default reminiscence consumption of full-precision vector utilizing the HNSW methodology towards reminiscence consumed by quantized vectors. The mannequin employed on this evaluation is the sentence-transformers/all-MiniLM-L12-v2, which operates with 384 dimensions. The uncooked metadata is assumed to be no more than 100Gb.

	With out quantization (in GB)	Product quantization (in GB)	Scalar quantization (in GB)
			FP16 vectors	Byte vectors	Binary vectors
m worth	16	16	16	16	16
pq_m, code_size	–	16, 8	–	–	–
Native reminiscence consumption (GB)	1830.4	184.8	985.6	563.2	193.6
Complete storage = 100 GB+vector	1930.4	284.8	1085.6	663.2	293.6

Reviewing the previous desk reveals that for a dataset comprising 1 billion vectors, the HNSW graph with 32-bit full-precision vector requires roughly 1830 GB of reminiscence. Compression methods corresponding to product quantization can cut back this to 184.8 GB, whereas scalar quantization gives various ranges of compression. The next desk summarizes the correlation between compression methods and their impression on key efficiency indicators together with value financial savings, recall charge, and question latency. This evaluation builds upon our earlier evaluation of reminiscence utilization to help in choosing compression method that meets your requirement.

The desk presents two key search metrics: search latency on the ninetieth percentile (p90) and recall at 100.

Search latency @p90 signifies that 90% of search queries shall be accomplished inside that particular latency time.
recall@100 – The fraction of the highest 100 floor fact neighbors discovered within the 100 outcomes returned.

	With out quantization (in GB)	Product quantization (in GB)	Scalar quantization (in GB)
			FP16 quantization [mode=in_memory]	Byte quantization [mode=in_memory]	Binary quantization [mode=on_disk]
Preconditions/Datasets	Relevant to all datasets	Recall depends upon the character of the coaching knowledge	Works for dimension worth in vary [-65536 to 65535]	Works for dimension worth in vary [-128 to 127]	Works nicely for bigger dimensions >=768
Preprocessing required?	No	Sure, preprocessing/coaching is required	No	No	No
Rescoring	No	No	No	No	Sure
Recall @100	>= 0.99	>0.7	>=0.95	>=0.95	>=0.90
p90 question latency (ms)	<50 ms	<50 ms	<50 ms	<50 ms	<200 ms
Price (baseline $X)	$X	$0.1*X (as much as 90% financial savings)	$0.5*X (as much as 50% financial savings)	$0.25*X (as much as 75%)	$0.15*X (as much as 85% financial savings)
Pattern value for a billion vector	$20,923.14	$2,092.31	$10,461.57	$5,230.79	$3,138.47

The pattern value estimate for billion vector relies on a configuration optimized for value. Please observe that precise financial savings could range primarily based in your particular workload necessities and chosen configuration parameters. Notably within the desk, product quantization gives as much as 90% value discount in comparison with the baseline HNSW graph-based vector search value ($X). Scalar quantization equally yields proportional value financial savings, starting from 50% to 85% relative to the compressed reminiscence footprint. The selection of compression method entails balancing cost-effectiveness, accuracy, and efficiency, because it impacts precision and latency.

Conclusion

By leveraging OpenSearch’s quantization methods, organizations could make knowledgeable tradeoffs between value effectivity, efficiency, and recall, empowering them to fine-tune their vector database operations for optimum outcomes. These quantization methods considerably cut back reminiscence necessities, enhance question effectivity and provide built-in encoders for seamless compression. Whether or not you’re coping with large-scale textual content embeddings, picture options, or every other high-dimensional knowledge, OpenSearch’s quantization methods provide environment friendly options for vector search necessities, enabling the event of cost-effective, scalable, and high-performance methods.

As you progress ahead together with your vector database initiatives, we encourage you to:

Discover OpenSearch’s compression methods in-depth
Consider applicability of the precise method to your particular use case
Decide the suitable compression ranges primarily based in your necessities for recall and search latency
Measure and evaluate value financial savings primarily based on accuracy, throughput, and latency

Keep knowledgeable concerning the newest developments on this quickly evolving discipline, and don’t hesitate to experiment with totally different quantization methods to seek out the optimum stability between value, efficiency, and accuracy to your functions.

Concerning the Authors

Aruna Govindaraju is an Amazon OpenSearch Specialist Options Architect and has labored with many business and open-source search engines like google and yahoo. She is captivated with search, relevancy, and consumer expertise. Her experience with correlating end-user alerts with search engine habits has helped many shoppers enhance their search expertise.

Vamshi Vijay Nakkirtha is a software program engineering supervisor engaged on the OpenSearch Challenge and Amazon OpenSearch Service. His major pursuits embody distributed methods. He’s an lively contributor to numerous OpenSearch initiatives corresponding to k-NN, Geospatial, and dashboard-maps.

Price Optimized Vector Database: Introduction to Amazon OpenSearch Service quantization methods

OpenSearch as a vector database

Binary quantization (as much as 32x compression)

Byte-quantization (4x compression)

FAISS FP16 quantization (2x compression)

Product quantization

Disk-based vector search

Efficiency evaluation of quantization strategies: Reviewing reminiscence, recall, and question latency.

Conclusion

Concerning the Authors

Related Articles

Revolutionizing Enterprise Knowledge Analytics at ReaderLink: From SQL to AI-Powered Insights

Claude 3.7 Sonnet is Anthropic’s AI Resurgence

the necessity for reminiscence security requirements

LEAVE A REPLY Cancel reply

Latest Articles

Revolutionizing Enterprise Knowledge Analytics at ReaderLink: From SQL to AI-Powered Insights

Claude 3.7 Sonnet is Anthropic’s AI Resurgence

the necessity for reminiscence security requirements

Asserting Free, Limitless Entry to Suppose Deeper and Voice

LightSpy Expands to 100+ Instructions, Rising Management Over Home windows, macOS, Linux, and Cell