134 lines
5.1 KiB
Plaintext
134 lines
5.1 KiB
Plaintext
|
|
1. FTS2 Tokenizers
|
|
|
|
When creating a new full-text table, FTS2 allows the user to select
|
|
the text tokenizer implementation to be used when indexing text
|
|
by specifying a "tokenizer" clause as part of the CREATE VIRTUAL TABLE
|
|
statement:
|
|
|
|
CREATE VIRTUAL TABLE <table-name> USING fts2(
|
|
<columns ...> [, tokenizer <tokenizer-name> [<tokenizer-args>]]
|
|
);
|
|
|
|
The built-in tokenizers (valid values to pass as <tokenizer name>) are
|
|
"simple" and "porter".
|
|
|
|
<tokenizer-args> should consist of zero or more white-space separated
|
|
arguments to pass to the selected tokenizer implementation. The
|
|
interpretation of the arguments, if any, depends on the individual
|
|
tokenizer.
|
|
|
|
2. Custom Tokenizers
|
|
|
|
FTS2 allows users to provide custom tokenizer implementations. The
|
|
interface used to create a new tokenizer is defined and described in
|
|
the fts2_tokenizer.h source file.
|
|
|
|
Registering a new FTS2 tokenizer is similar to registering a new
|
|
virtual table module with SQLite. The user passes a pointer to a
|
|
structure containing pointers to various callback functions that
|
|
make up the implementation of the new tokenizer type. For tokenizers,
|
|
the structure (defined in fts2_tokenizer.h) is called
|
|
"sqlite3_tokenizer_module".
|
|
|
|
FTS2 does not expose a C-function that users call to register new
|
|
tokenizer types with a database handle. Instead, the pointer must
|
|
be encoded as an SQL blob value and passed to FTS2 through the SQL
|
|
engine by evaluating a special scalar function, "fts2_tokenizer()".
|
|
The fts2_tokenizer() function may be called with one or two arguments,
|
|
as follows:
|
|
|
|
SELECT fts2_tokenizer(<tokenizer-name>);
|
|
SELECT fts2_tokenizer(<tokenizer-name>, <sqlite3_tokenizer_module ptr>);
|
|
|
|
Where <tokenizer-name> is a string identifying the tokenizer and
|
|
<sqlite3_tokenizer_module ptr> is a pointer to an sqlite3_tokenizer_module
|
|
structure encoded as an SQL blob. If the second argument is present,
|
|
it is registered as tokenizer <tokenizer-name> and a copy of it
|
|
returned. If only one argument is passed, a pointer to the tokenizer
|
|
implementation currently registered as <tokenizer-name> is returned,
|
|
encoded as a blob. Or, if no such tokenizer exists, an SQL exception
|
|
(error) is raised.
|
|
|
|
SECURITY: If the fts2 extension is used in an environment where potentially
|
|
malicious users may execute arbitrary SQL (i.e. gears), they should be
|
|
prevented from invoking the fts2_tokenizer() function, possibly using the
|
|
authorisation callback.
|
|
|
|
See "Sample code" below for an example of calling the fts2_tokenizer()
|
|
function from C code.
|
|
|
|
3. ICU Library Tokenizers
|
|
|
|
If this extension is compiled with the SQLITE_ENABLE_ICU pre-processor
|
|
symbol defined, then there exists a built-in tokenizer named "icu"
|
|
implemented using the ICU library. The first argument passed to the
|
|
xCreate() method (see fts2_tokenizer.h) of this tokenizer may be
|
|
an ICU locale identifier. For example "tr_TR" for Turkish as used
|
|
in Turkey, or "en_AU" for English as used in Australia. For example:
|
|
|
|
"CREATE VIRTUAL TABLE thai_text USING fts2(text, tokenizer icu th_TH)"
|
|
|
|
The ICU tokenizer implementation is very simple. It splits the input
|
|
text according to the ICU rules for finding word boundaries and discards
|
|
any tokens that consist entirely of white-space. This may be suitable
|
|
for some applications in some locales, but not all. If more complex
|
|
processing is required, for example to implement stemming or
|
|
discard punctuation, this can be done by creating a tokenizer
|
|
implementation that uses the ICU tokenizer as part of its implementation.
|
|
|
|
When using the ICU tokenizer this way, it is safe to overwrite the
|
|
contents of the strings returned by the xNext() method (see
|
|
fts2_tokenizer.h).
|
|
|
|
4. Sample code.
|
|
|
|
The following two code samples illustrate the way C code should invoke
|
|
the fts2_tokenizer() scalar function:
|
|
|
|
int registerTokenizer(
|
|
sqlite3 *db,
|
|
char *zName,
|
|
const sqlite3_tokenizer_module *p
|
|
){
|
|
int rc;
|
|
sqlite3_stmt *pStmt;
|
|
const char zSql[] = "SELECT fts2_tokenizer(?, ?)";
|
|
|
|
rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
|
|
if( rc!=SQLITE_OK ){
|
|
return rc;
|
|
}
|
|
|
|
sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
|
|
sqlite3_bind_blob(pStmt, 2, &p, sizeof(p), SQLITE_STATIC);
|
|
sqlite3_step(pStmt);
|
|
|
|
return sqlite3_finalize(pStmt);
|
|
}
|
|
|
|
int queryTokenizer(
|
|
sqlite3 *db,
|
|
char *zName,
|
|
const sqlite3_tokenizer_module **pp
|
|
){
|
|
int rc;
|
|
sqlite3_stmt *pStmt;
|
|
const char zSql[] = "SELECT fts2_tokenizer(?)";
|
|
|
|
*pp = 0;
|
|
rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
|
|
if( rc!=SQLITE_OK ){
|
|
return rc;
|
|
}
|
|
|
|
sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
|
|
if( SQLITE_ROW==sqlite3_step(pStmt) ){
|
|
if( sqlite3_column_type(pStmt, 0)==SQLITE_BLOB ){
|
|
memcpy(pp, sqlite3_column_blob(pStmt, 0), sizeof(*pp));
|
|
}
|
|
}
|
|
|
|
return sqlite3_finalize(pStmt);
|
|
}
|