Home

Getting Started

Utilities

Indexing

Omnidex

Development

Tutorials

Quick Links

 

OMNIDEX

Omnidex Text

Table

Environment Source

Index Installation

 

Omnidex Text

$RETRIEVE_FILE

 

External Documents

Some databases may be used to catalog a collection of external files. In these situations, the database contains a series of columns such as title, authorship and filename. The database itself does not contain the data in the file; the database is only used to catalog the filenames. Applications will then retrieve a catalog entry from the database and then independently retrieve the contents of the file.

Omnidex can enhance these applications by allowing the content of the external documents to be dynamically referenced as a column in the catalog table. This allows the content to be indexed using Omnidex, as though it were part of the table, and, as a result, to be searched and retrieved using SQL statements.

It is important to note that the data is not physically transferred into the database; instead it is retrieved from the file as needed for indexing and retrieval.

The EXTRACT_TEXT option (see $RETRIEVE_FILE - Options ) extracts text from a formatted or binary file. this option is useful for obtaining the textual content from a Microsoft Word or Adobe Acrobat Reader (pdf) file. In the case of XML or HTML, the EXTRACT_TEXT function can be useful for obtaining all text without any of the tags.

When working with HTML and XML documents, it may be desirable to prevent certain tags from being indexed or retrieved. This is common with formatting tags such as font declarations, headers and footers. Administrators can maintain lists of these tags, and then apply lists on a per-column basis.

 

Table

The database table can have any variety of columns, just as any other table. One of these columns should contain the name of the external file to be indexed. The following BOOKS table shows a simple example of what the database table might look like.

BOOK_ID INTEGER
TITLE VARCHAR(100)
AUTHOR VARCHAR(50)
CATEGORY VARCHAR(20)
PUB_DATE DATETIME
FILE_NAME VARCHAR(255)

Any or all of these columns can be indexed with Omnidex.

Notice that there is not a column for the book contents. That is because the book is stored external to the database. The FILE_NAME column will contain the name of the file and possibly the path. The file must reside on the same machine as the database and Omnidex must have read access to it.

 

Environment Source

The environment source file entry for this table, assuming it is part of the STORYBOOKS database in SQL Server, might look like the following:

TABLE        BOOKS
TYPE         RELATIONAL
PHYSICAL    "STORYBOOKS.dbo.BOOKS"
PRIMARY KEY "BOOK_ID"

COLUMN "BOOK_ID" PHYSICAL "book_id"     DATATYPE INTEGER
USAGE ROWID
COLUMN "TITLE"     PHYSICAL "title"     DATATYPE VARCHAR(100)
COLUMN "AUTHOR"    PHYSICAL "author"    DATATYPE VARCHAR(50)
COLUMN "CATEGORY"  PHYSICAL "category"  DATATYPE VARCHAR(20)
COLUMN "PUB_DATE"  PHYSICAL "pub_date"  DATATYPE ODBC DATETIME
COLUMN "FILE_NAME" PHYSICAL "file_name" DATATYPE VARCHAR(255)
COLUMN "CONTENT"                        DATATYPE CLOB(16MB)
        as "$retrieve_file(FILE_NAME)"

The last column, "CONTENT", is a pseudocolumn, meaning it does not exist in the database. The $RETRIEVE_FILE function in the AS clause of the "CONTENT" column opens the file and reads in the content for indexing. With Omnidex installed on it, the contents of the file, as the column name implies, can be indexed and searched using any of the text search capabilities provided with Omnidex. As far as Omnidex is concerned, the "CONTENT" column is just another column indexed in this database.

Additionally, all of the other columns in the table can be indexed. With Omnidex installed on this table, you can search for books by title, author, category, publication date, the name of the file the book is stored in, or even the contents of the book.

 

Index Installation

A sample index installation on the BOOKS table might look like the following:

table: BOOKS
column: TITLE;KW;PX
column: AUTHOR;KW;PX
column: CATEGORY;KW
column: PUB_DATE
column: FILE_NAME;KW
column: CONTENT;KW;PX

 

Top