Integration: Raw Data Files

Delimited Files

Environments

The Omnidex Environment File is the key to setting up Omnidex on raw data files. Delimited files do not contain metadata about columns and datatypes. The Omnidex Environment File fulfills this role.

Declaring Databases for Raw Data Files

Databases in Omnidex are of a particular type, such as Oracle, SQL Server, MySQL or ODBC. A database that accesses a raw data file has its own type: FILE. All tables within this database must be raw data files, though they can be a mixture of the types of raw data files supported by Omnidex.

create database       "SIMPLE"
 type                 FILE
 index_directory      "idx"
 in                   "simple.xml";

Declaring Tables for Delimited Files

Tables will point to one or more delimited files, and the files must match the record layout specified in the CREATE TABLE statement. In the CREATE TABLE statement, the OPTIONS clause is used to identify this as a delimited file and indicate which delimiters to use. The DELIMITED option indicates that this is a delimited file. The remaining options govern the specific delimiters and behavior to use when parsing the file.

In the CREATE TABLE statement, the PHYSICAL clause is used to point to the physical location of the raw data file. This PHYSICAL clause can contain a relative path name (relative to the location of the Omnidex Environment File) such as “dat/states.csv”, or an absolute path name such as “d:/class/lab1/dat/states.csv”. This PHYSICAL clause can also contain wildcards as described below, and can also reference environment variables using special syntax.

create table          "STATES"
 options              "DELIMITED COLUMN=',' RECORD='\r\n' QUOTES"
 physical             "dat/states.csv"
 ...

In the example above, a single file was used as the source of data for a table. It is also possible for multiple files to be used as the source of data for a table. As long as the files all share the same record layout, they will all be processed when reading the table. Filesets can be declared as a comma-separated list of files, or they can be declared using wildcards in the filename. In this example, three files are declared individually in the PHYSICAL clause:

create table          "STATES"
 options              "DELIMITED COLUMN=',' RECORD='\r\n' QUOTES"
 physical             "c:/class/lab1/dat/states1.csv, 
                       c:/class/lab1/dat/states2.csv, 
                       d:/class/lab1/dat/states3.csv"
 ...

In this second example, wildcards are used to declare these files:

create table          "STATES"
 options              "DELIMITED COLUMN=',' RECORD='\r\n' QUOTES"
 physical             "c:/class/lab1/dat/states*.csv, 
                       d:/class/lab1/dat/states*.csv"
 ...

Files can reside in multiple directories and in multiple drives or volumes to allow a table to span multiple storage devices. Files may also reside on a Storage Area Network (SAN); in fact, this is one of the most common implementations of Omnidex. Files may also reside on Microsoft Windows mapped drives; however, users should expect performance degradation to occur in this scenario.

Wildcards provide greater flexibility since they allow more data to be added to a table by simply moving another file into place. Care should be taken with this approach, though, since the table must be reindexed as soon as new data is in place.

Additional Resources

See also:

 
Back to top
integration/rawdata/delimited/environments.txt · Last modified: 2016/06/28 22:38 (external edit)
 
 
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0