|
OMNIDEX |
Omnidex Text |
|
External DocumentsSome databases may be used to catalog a collection of external files. In these situations, the database contains a series of columns such as title, authorship and filename. The database itself does not contain the data in the file; the database is only used to catalog the filenames. Applications will then retrieve a catalog entry from the database and then independently retrieve the contents of the file. Omnidex can enhance these applications by allowing the content of the external documents to be dynamically referenced as a column in the catalog table. This allows the content to be indexed using Omnidex, as though it were part of the table, and, as a result, to be searched and retrieved using SQL statements. It is important to note that the data is not physically transferred into the database; instead it is retrieved from the file as needed for indexing and retrieval. The EXTRACT_TEXT option (see $RETRIEVE_FILE - Options ) extracts text from a formatted or binary file. this option is useful for obtaining the textual content from a Microsoft Word or Adobe Acrobat Reader (pdf) file. In the case of XML or HTML, the EXTRACT_TEXT function can be useful for obtaining all text without any of the tags. When working with HTML and XML documents, it may be desirable to prevent certain tags from being indexed or retrieved. This is common with formatting tags such as font declarations, headers and footers. Administrators can maintain lists of these tags, and then apply lists on a per-column basis.
TableThe database table can have any variety of columns, just as any other table. One of these columns should contain the name of the external file to be indexed. The following BOOKS table shows a simple example of what the database table might look like. BOOK_ID INTEGER Any or all of these columns can be indexed with Omnidex. Notice that there is not a column for the book contents. That is because the book is stored external to the database. The FILE_NAME column will contain the name of the file and possibly the path. The file must reside on the same machine as the database and Omnidex must have read access to it.
Environment SourceThe environment source file entry for this table, assuming it is part of the STORYBOOKS database in SQL Server, might look like the following: TABLE BOOKS COLUMN "BOOK_ID" PHYSICAL "book_id"
DATATYPE INTEGER The last column, "CONTENT", is a pseudocolumn, meaning it does not exist in the database. The $RETRIEVE_FILE function in the AS clause of the "CONTENT" column opens the file and reads in the content for indexing. With Omnidex installed on it, the contents of the file, as the column name implies, can be indexed and searched using any of the text search capabilities provided with Omnidex. As far as Omnidex is concerned, the "CONTENT" column is just another column indexed in this database. Additionally, all of the other columns in the table can be indexed. With Omnidex installed on this table, you can search for books by title, author, category, publication date, the name of the file the book is stored in, or even the contents of the book.
Index InstallationA sample index installation on the BOOKS table might look like the following: table: BOOKS
|
|
Environment Catalog | Database Integration | Omnidex Text | Other Features |
Environment |
Database |
Table |
Column |
Rule |
Index |
Message |
SQL Server |
Oracle |
DB2 |
Flat Files |
External Documents |
Configuration |
Keyword Searches |
Synonym Searches |
Proximity Searches |
Form Searches |
Misspelling Searches |
Spell Check Searches |
Phonetic Searches |
Exporting Data |
Pseudo-columns |
Partitioning |