Omnidex Grids provide high performance by allowing a large database to be partitioned into a set of smaller databases called grid nodes. One distinct advantage of this is that some queries can be resolved to a smaller subset of nodes. When Omnidex optimizes a query, it may be able to determine that the query can be resolved by accessing the indexes and data on only a few of the nodes. This can produce significant improvements in performance and greatly reduce the load on the servers. For example, if a database is partitioned into ten nodes representing ten geographic regions, then queries which request rows from only one region can be limited to the single node containing that region. This means that only 10% of the data and indexes must be evaluated, and 90% of the data and indexes can be avoided.
The partitioning scheme used in an Omnidex Grid determines how the database will be distributed into the grid nodes. There are several questions that should be answered to determine the partitioning scheme.
Generally, the best place to start is to examine the queries. If queries are logged over a period of time and analyzed, there are generally patterns to the criteria. In many databases, most queries begin with criteria such as geography, client IDs, product IDs, or dates. If one set of criteria is common to most queries, then that becomes a logical candidate for the partition qualifier. For example, many databases are partitioned by a geographic region code, a state code or a zip code. Queries that provide criteria against these geographic codes can then be directed solely to those grid nodes that contain those rows.
There can be other reasons for choosing a partition qualifier. Some databases are segmented using a client ID or a product ID, and all queries will be narrowed to that particular client or product. Partitioning the database by these IDs not only speeds the queries, but it also makes it possible to perform maintenance on individual clients or products because they are isolated into a specific node. Other databases provide a revolving view of recent data, such as the last 12 months or the last 30 days. In these cases, partitioning by the month or the day allows old data to be archived and new data to be incorporated without causing maintenance on the entire database.
The application is unaware of the partitioning scheme. It simply sends queries and expects the results to come back quickly. This means that it is possible to alter the partitioning scheme without changing the application. Repartitioning a database can still be a large endeavor, but it can be scheduled by the database administrators without having to schedule development resources.
Generally, grid nodes contain between 10 million and 50 million rows per table. This is not a rule; instead, it is representative of the performance and operational needs for most databases. Many Omnidex applications want to obtain counts from the database for web applications, usually with a target of 1-2 second performance. These targets can usually be reached with nodes containing 10-50 million rows; however, this depends greatly on the quality of the hardware, the number of concurrent processes and the nature of the queries.
Generally, Omnidex Grids contain somewhere between 4 and 25 nodes. Again, this is not a rule, but is representative of most databases. Having a large number of nodes often means that the individual nodes are smaller, and queries that are restricted to individual nodes are quite fast. However, not all queries can be restricted to an individual node. Some queries may have to be processed against the entire database, and therefore on all nodes. These queries will still benefit from concurrently processing on parallel processors; however, the results still must be transferred to the grid controller and reassembled into the final result set. If there are too many nodes, the overhead of these types of queries can become excessive. A quantity of 4-25 nodes provides the opportunity for great performance gains without too much work reassembling the results on the controller.
Relational databases usually contain parent-child relationships between the tables, enforced through primary and foreign key constraints. When a parent table is partitioned, it is important that the corresponding child tables (and further descendant tables) be partitioned to maintain referential integrity. For example, if a customer table is partitioned by geography, it is important that the corresponding child tables, such as orders or mailings, be partitioned by the corresponding foreign key constraint so that each customer’s entire parent-child family is located on the same node.
Some queries request distinct counts of a particular column, or aggregate data based on one or more columns. These queries warrant some additional attention to insure optimal performance. If these queries require processing on more than one node, the grid controller may have to do a significant amount of work to reassemble and re-aggregate the results. What matters most is whether the columns involved share values across multiple nodes of the grid.
For example, if a customer table is partitioned by geography, it is likely that the values of some columns will be unique to a node, such as the primary key, the household ID and perhaps the state or zip code fields. It is likely that the values of other columns will not be unique to a node, such as name or city columns. If a query requests a count of the distinct households, Omnidex can recognize that the values for this column are not shared between nodes. This means that the grid controller can simply request a count of distinct households from each node and then add them together.
The same strategy will not work for the name or city column since names and cities are shared across multiple nodes. If the grid controller summed the count of distinct names or cities from each node, it would have an inflated count that reflects duplicate values between the nodes. To address this, the grid controller must request the actual distinct values from each node and then re-aggregate the data to obtain the proper counts.
This example illustrates that the performance is affected by the number of distinct values or aggregations that are returned from each node. A count of the distinct names will take longer than a count of the distinct gender codes. Each node may have to return thousands of names to the grid controller, but each node would only have to send two gender codes. With this in mind, the database should be partitioned to avoid allowing high cardinality columns used in distinct counts or aggregations to have duplicate values across multiple nodes.
DISC has a team of consultants that are available to work with you on the best design for your Omnidex Grid. Please contact us if we can be of any service.
Now that you have determined your partitioning scheme, it is time to determine your distribution plan .