Integration: Raw Data Files

File Types

Omnidex supports a variety of raw data files. Fixed-length files are a simple binary, flat file that uses a consistent number of bytes per row. Delimited files store data in character format and use delimiters to separate columns and rows. Omnidex Standalone Tables (OSTs) are proprietary files that store data in a compressed fashion and can be easily moved around. Support is planned for the future for Hadoop Distributed File System (HDFS) files.

In general, Omnidex data files must maintain a consistent structure, meaning that the data has consistent rows and columns. For example, relational database systems allow data to be exported into data files, and these files are ideal for indexing with Omnidex. Similarly, companies often receive data from vendors or suppliers in this same form, and these files can be indexed directly without having to load the data into a relational database.

Fixed-length Files

Fixed length files will always use the same number of bytes for each column and each row, regardless of the content of the data. No delimiters are used, and instead each column and each row can be located based on its offset within the file. Binary data such as integers, floating point, and date datatypes are stored in their native, binary format.

In the example below, each row consumes 44 bytes of the file, with the first row starting at the beginning of the file, the second row beginning at offset 44, the third row beginning at offset 88, and so forth. Note that the STRING datatype stores one more byte than the number of characters allowed, which is storage for the terminating NULL character. Also note that the FLOAT datatype requires 4 bytes regardless of the number of digits displayed, since a binary floating point value always requires 4 bytes of storage.

Column STATE DESCRIPTION STATE_CODE REGION COUNTRY TAX_RATE
Datatype CHAR(2) STRING(31) CHAR(2) CHAR(2) CHAR(2) FLOAT
Bytes of Storage 2 32 2 2 2 4
Offset 0 AK Alaska 02 PC US 0.000
Offset 44 AL Alabama 01 ES US 4.000
Offset 88 AR Arkansas 05 WS US 4.625

Delimited Files

Delimited files separate the content of columns and rows using special delimiter characters. These are commonly used by relational databases and other data-related tools as a standard, portable data format. Tabs and commas are the most common delimiters between columns, and linefeeds are the most common delimiters between rows. Delimited files always store data in character format, though they can correlate to binary datatypes in a table definition.

Omnidex supports a wide variety of delimited files. Omnidex supports the standard tab-delimited and comma-delimited formats, but also allows any combination of one or two characters for delimiters. This allows great flexibility when receiving data from other sources, and also allows administrators to create delimited files based on their specific needs. Omnidex also allows configuration options specifying how quotation marks are handled, how escape characters are used and whether header rows are present.

In the example below, each row consumes a variable amount of space, based on the actual length of the data. Columns are separated by commas and rows are separated by linefeeds, so Omnidex will parse the file to read it as a table. All data is stored in character format, including numeric columns.

AK,Alaska,02,PC,US,0.000000                  
AL,Alabama,01,ES,US,4.000000                 
AR,Arkansas,05,WS,US,4.625000                
AZ,Arizona,04,MT,US,5.000000                 
CA,California,06,PC,US,6.000000              
CO,Colorado,08,MT,US,3.000000                
CT,Connecticut,09,NE,US,6.000000             
...

Omnidex Standalone Tables (OSTs)

Omnidex Standalone Tables (OSTs) are a proprietary storage format that stores data, indexes, and metadata for a table all in one file. Data is compressed for faster access. The primary purpose of an OST is to allow data to be easily moved around and dynamically attached to different environments; however, OSTs can be directly referenced in Omnidex Environment Files as well.

Omnidex Standalone Tables are a good solution for tables that have lots of binary data as well as lots of large textual data. OSTs will provide excellent compression and performance in these situations.

Additional Resources

See also:

 
Back to top
integration/rawdata/types.txt ยท Last modified: 2016/06/28 22:38 (external edit)