CREATE_PREFERENCE
Use the DBMS_VECTOR_CHAIN.CREATE_PREFERENCE preference helper procedure to create a vectorizer preference, to be used when creating or updating hybrid vector indexes.
Purpose
To create a vectorizer preference. This allows you to customize vector search parameters of a hybrid vector indexing pipeline. The goal of a vectorizer preference is to provide you with a straightforward way to configure how to chunk or embed your documents, without requiring a deep understanding of various chunking or embedding strategies.
A vectorizer preference is a JSON object that collectively holds user-specified values related to the following chunking, embedding, or vector index creation parameters:
-
Chunking (
UTL_TO_CHUNKSorVECTOR_CHUNKS) -
Embedding (
UTL_TO_EMBEDDING,UTL_TO_EMBEDDINGS, orVECTOR_EMBEDDING) -
Vector index creation (
distance,accuracy, andvector_idxtype)
After creating a vectorizer preference, you can use the VECTORIZER parameter to pass this preference name in the paramstring of the PARAMETERS clause for CREATE_HYBRID_VECTOR_INDEX and ALTER_INDEX SQL statements.
Usage Notes
-
Creating a preference is optional. If you do not specify any optional preference, then the index is created with system defaults.
-
All vector index preferences follow the same JSON syntax as defined for their corresponding
DBMS_VECTORandDBMS_VECTOR_CHAINAPIs.
Syntax
DBMS_VECTOR_CHAIN.CREATE_PREFERENCE (
PREF_NAME IN VARCHAR2,
PREF_TYPE IN VARCHAR2,
PARAMS IN JSON default NULL
);PREF_NAME
Specify the name of the vectorizer preference to create.
PREF_TYPE
Type of preference. The only supported preference type is:
DBMS_VECTOR_CHAIN.VECTORIZER
PARAMS
Specify vector search-specific parameters in JSON format.
-
Embedding Parameter:
{ "model" : <embedding_model_for_vector_generation> }For example:
{ "model" : MY_INDB_MODEL }modelspecifies the name under which your ONNX embedding model is stored in the database.If you do not have an in-database embedding model in ONNX format, then perform the steps listed in Oracle Database AI Vector Search User's Guide.
-
Chunking Parameters:
{ "by" : mode, "max" : max, "overlap" : overlap, "split" : split_condition, "custom_list" : split_chars1, ..., "vocabulary" : vocabulary_name, "language" : nls_language, "normalize" : normalize_mode, "norm_options" : normalize_option1, ..., "extended" : boolean }For example:
JSON(' { "by" : "vocabulary", "vocabulary" : "myvocab", "max" : "100", "overlap" : "0", "split" : "custom", "custom_list" : [ "<p>" , "<s>" ], "language" : "american", "normalize" : "options", "norm_options" : [ "whitespace" ] }')Parameter Description and Acceptable Values bySpecify a mode for splitting your data, that is, to split by counting the number of characters, words, or vocabulary tokens.
Valid values:
-
characters(orchars):Splits by counting the number of characters.
-
words:Splits by counting the number of words.
Words are defined as sequences of alphabetic characters, sequences of digits, individual punctuation marks, or symbols. For segmented languages without whitespace word boundaries (such as Chinese, Japanese, or Thai), each native character is considered a word (that is, unigram).
-
vocabulary:Splits by counting the number of vocabulary tokens.
Vocabulary tokens are words or word pieces, recognized by the vocabulary of the tokenizer that your embedding model uses. You can load your vocabulary file using the chunker helper API
DBMS_VECTOR_CHAIN.CREATE_VOCABULARY.Note: For accurate results, ensure that the chosen model matches the vocabulary file used for chunking. If you are not using a vocabulary file, then ensure that the input length is defined within the token limits of your model.
Default value:
wordsmaxSpecify a limit on the maximum size of each chunk. This setting splits the input text at a fixed point where the maximum limit occurs in the larger text. The units of
maxcorrespond to thebymode, that is, to split data when it reaches the maximum size limit of a certain number of characters, words, numbers, punctuation marks, or vocabulary tokens.Valid values:
-
by characters:50to4000characters -
by words:10to1000words -
by vocabulary:10to1000tokens
Default value:
100split [by]Specify where to split the input text when it reaches the maximum size limit. This helps to keep related data together by defining appropriate boundaries for chunks.
Valid values:
-
none:Splits at the
maxlimit of characters, words, or vocabulary tokens. -
newline,blankline, andspace:These are single-split character conditions that split at the last split character before the
maxvalue.Use
newlineto split at the end of a line of text. Useblanklineto split at the end of a blank line (sequence of characters, such as two newlines). Usespaceto split at the end of a blank space. -
recursively:This is a multiple-split character condition that breaks the input text using an ordered list of characters (or sequences).
recursivelyis predefined asBLANKLINE,newline,space,nonein this order:1. If the input text is more than the
maxvalue, then split by the first split character.2. If that fails, then split by the second split character.
3. And so on.
4. If no split characters exist, then split by
maxwherever it appears in the text. -
sentence:This is an end-of-sentence split condition that breaks the input text at a sentence boundary.
This condition automatically determines sentence boundaries by using knowledge of the input language's sentence punctuation and contextual rules. This language-specific condition relies mostly on end-of-sentence (EOS) punctuations and common abbreviations.
Contextual rules are based on word information, so this condition is only valid when splitting the text by words or vocabulary (not by characters).
Note: This condition obeys the
by wordandmaxsettings, and thus may not determine accurate sentence boundaries in some cases. For example, when a sentence is larger than themaxvalue, it splits the sentence atmax. Similarly, it includes multiple sentences in the text only when they fit within themaxlimit. -
custom:Splits based on a custom split characters list. You can provide custom sequences up to a limit of
16split character strings, with a maximum length of10each.Specify an array of valid text literals using the
custom_listparameter.{ "split" : "custom", "custom_list" : [ "split_chars1", ... ] }For example:
{ "split" : "custom", "custom_list" : [ "<p>" , "<s>" ] }Note: You can omit sequences only for tab (
\t), newline (\n), and linefeed (\r).
Default value:
recursivelyoverlapSpecify the amount (as a positive integer literal or zero) of the preceding text that the chunk should contain, if any. This helps in logically splitting up related text (such as a sentence) by including some amount of the preceding chunk text.
The amount of overlap depends on how the maximum size of the chunk is measured (in characters, words, or vocabulary tokens). The overlap begins at the specified
splitcondition (for example, atnewline).Valid value:
5%to20%ofmaxDefault value:
0languageSpecify the language of your input data.
This clause is important, especially when your text contains certain characters (for example, punctuations or abbreviations) that may be interpreted differently in another language.
Valid values:
-
NLS-supported language name or its abbreviation, as listed in Oracle Database Globalization Support Guide.
-
Custom language name or its abbreviation, as listed in Supported Languages and Data File Locations. You use the
DBMS_VECTOR_CHAIN.CREATE_LANG_DATAchunker helper API to load language-specific data (abbreviation tokens) into the database, for your specified language.
Default value:
NLS_LANGUAGEfrom sessionnormalizeAutomatically pre-processes or post-processes issues (such as multiple consecutive spaces and smart quotes) that may arise when documents are converted into text. Oracle recommends you to use a normalization mode to extract high-quality chunks.
Valid values:
-
none:Applies no normalization.
-
all:Normalizes common multi-byte (unicode) punctuation to standard single-byte.
-
options:Specify an array of normalization options using the
norm_optionsparameter.{ "normalize" : "options", "norm_options" : [ "normalize_option1", ... ] }-
punctuation:Includes smart quotes, smart hyphens, and other multi-byte equivalents to simple single-byte punctuation.
For example:-
2018u 'map to 0027' -
2019u 'map to 0027' -
201Bu 'map to 0027'
-
-
whitespace:Minimizes whitespace by eliminating unnecessary characters.
For example, retain blanklines, but remove any extra newlines and interspersed spaces or tabs:
" \n \n " => "\n\n" -
widechar:Normalizes wide, multi-byte digits and (a-z) letters to single-byte.
These are multi-byte equivalents for
0-9anda-z A-Z, which can show up inZH/JAformatted text.
For example:
{ "normalize" : "options", "norm_options" : [ "whitespace" ] } -
Default value: None
extendedIncreases the output limit of a
VARCHAR2string to32767bytes, without requiring you to set themax_string_sizeparameter toextended.Default value:
4000or32767(whenmax_string_size=extended) -
-
Vector Index Parameters:
{ "distance" : <vector_distance>, "accuracy" : <vector_accuracy>, "vector_idxtype" : <vector_idxtype> }For example:
{ "distance" : COSINE, "accuracy" : 95, "vector_idxtype" : HNSW }Parameter Description distanceDistance computation metric as
COSINE,MANHATTAN,DOT,EUCLIDEAN,L2_SQUARED, orEUCLIDEAN_SQUARED.Note: Currently, the
HAMMINGandJACCARDvector distance metrics are not supported with hybrid vector indexes.For detailed information on each of these metrics, see Vector Distance Functions and Operators.
Default value:
COSINEaccuracyTarget accuracy at which the approximate search should be performed when running an approximate search query using vector indexes. You can specify non-default target accuracy values either by specifying a percentage value or by specifying internal parameters values, depending on the index type you are using.
-
For an IVF approximate search: Specify a target accuracy percentage value to influence the number of partitions used to probe the search. Instead of specifying a target accuracy percentage value, you can specify the
NEIGHBOR PARTITION PROBESparameter to impose a certain maximum number of partitions to be probed by the search. See Understand Inverted File Flat Vector Indexes. -
For an HNSW approximate search: Specify a target accuracy percentage value to influence the number of candidates considered to probe the search. Instead of specifying a target accuracy percentage value, you can specify the
EFSEARCHparameter to impose a certain maximum number of candidates to be considered while probing the index. See Understand Hierarchical Navigable Small World Indexes.
Valid range for both IVF and HNSW vector indexes is:
ACCURACY: > 0 and <= 100Default value: None
vector_idxtypeType of vector index to create:
-
IVFfor the Inverted File Flat (IVF) vector index -
HNSWfor the Hierarchical Navigable Small World (HNSW) vector index
For detailed information on each of these indexes, see Manage the Different Categories of Vector Indexes.
Default value:
IVF -
Example
begin
DBMS_VECTOR_CHAIN.CREATE_PREFERENCE(
'my_vec_spec',
DBMS_VECTOR_CHAIN.VECTORIZER,
json('{"vector_idxtype" : "hnsw",
"model" : "my_doc_model",
"by" : "words",
"max" : 100,
"overlap" : 10,
"split" : "recursively"}'));
end;
/
CREATE HYBRID VECTOR INDEX my_hybrid_idx on
doc_table(text_column)
parameters('VECTORIZER my_vec_spec');Related Topics
Parent topic: DBMS_VECTOR_CHAIN