A cube represents a multidimensional focus which helps users analyze the content of data according to their own mental model, rather than being limited by the database tables' model. It allows them to connect values linked by business concepts, rather than their tables' properties. The relationship between source data and the user's analytical focus is defined during the cube creation process, using the concepts outlined below.
- A cube is a group of data cells arranged according to their analytical dimensions. For example, a spreadsheet is a type of two-dimensional matrix with data cells organized into rows and columns. In this case, each row and column is a dimension. A three-dimensional matrix can be visualized as a cube, with each dimension including all slices parallel to that side of the cube. Higher dimensions have no physical representation; rather, data is organized according to user or company mental models. Typical dimensions for a company include time, products, geographical regions, sales channels, etc.
- A dimension is a structural axis of a cube, containing a list of values. These values are of a similar type, according to the user's perception of the data. For example, all months, quarters, years, etc. constitute a dimension of Time. Similarly, cities, regions, countries, etc. are considered geographic dimensions. A dimension acts as an index to identify values within a multidimensional matrix (cube). If a member of the dimension has been selected, the remaining dimensions, where a series of members (or all members) are selected, are considered a 'sub-cube'. This is a concise and intuitive way to organize and select data for collection, exploration, and analysis.
- When the dimension corresponds to a quantitative type of value, it is called a measure. This means that sales volume or billing amounts are possible cube measures. When they intersect with other dimensions such as product or city, we can use them to obtain that measure's total values for those dimensions.
- When a set of values in a dimension is grouped together to form an analytical concept, this creates a hierarchy. Let's take the dimension product, which may contain values such as product codes, product families, or product categories, as an example. We can consider each of these attributes a hierarchy for analysis. When we select this hierarchy and cross it with the measure sales, sales totals are obtained for each.
- In certain cases, levels may be defined within a hierarchy, and parent-child relationships will exist between these levels. For example, for a Time hierarchy within a Time dimension, the following levels may be defined: half-year, quarter, week. Each time value (dd/mm/yyyy) has a level, and some levels contain others.
In our tool, the construction of an analytical cube is based on XML standards of Mondrian. Mondrian is used to define the creation of the various elements which comprise a cube.
These concepts and terms comprise the basis of cube creation.
- Schema: development tool used to create, modify, and publish a Mondrian schema.
- Cube: data structure designed for rapid analysis according to the dimensions which connect to a concrete business problem.
- Table: fact table structure.
- Dimension: specifies the dimensions of the XML schema.
- Hierarchy: hierarchical structure of values within a dimension.
- Level: levels of abstraction defined by conceptual hierarchies.
A schema specification contains all the XML code defining a particular OLAP schema. We use the XML tag <Schema>. For example:
<?xml version="1.0"?> <Schema name="FoodMart"> ... </Schema>
We can define an unlimited number of OLAP schemas in every dictionary. When users execute a query on a database, they are always focused on a particular OLAP schema. Therefore, definitions of Dimensions, Measures, etc. cannot be shared between multiple schemas.
However, two different schemas can use the same data by sharing the same <Table> or <Join> definitions (see next section for details).
There are two kind of tables: fact tables and dimension tables. Both use the same tag <Table>.
Fact table is the central table of a dimensional scheme and it is composed of measures and dimensions (usually business indicators) that are obtained from the tables surround it. The relationship between dimensions and measures is the fact table.
<Cube name="Store"> <Table name="Store Type"/> ... </Cube>
Dimension tables are defined by indicating the hierarchy levels that compose it within a dimension definition. The tag to define the dimension table is <Table>.
<Dimension name="Store Size in SQFT"> <Hierarchy hasAll="true" primaryKey="store_id"> <Table name="store"/> <Level name="Store Sqft" column="store_sqft" type="Numeric" uniqueMembers="true"/> </Hierarchy> </Dimension>
For instance, in the example above, the values of the dimension come from the "store_sqft" colum in the "store" table.
To base a dimension on various tables we use the <Join> tag. This way you avoid having to create a new table.
<Dimension name="Pay Type" foreignKey="employee_id"> <Hierarchy hasAll="true" primaryKey="employee_id" primaryKeyTable="employee"> <Join leftKey="position_id" rightKey="position_id"> <Table name="employee"/> <Table name="position"/> </Join> <Level name="Pay Type" table="position" column="pay_type" uniqueMembers="true"/> </Hierarchy> </Dimension>
Simple schemes must be written using the ordinary clause JOIN. More complex schemes (in those cases where we have to combine more than two tables), it will be easier to built the scheme using a view table. A view is the result set of a stored query on the data, and it is defined in the physical dictionary/catalog wic_table_object. For further information see section Example: schema defined by a view table.
A Dimension definition consists of various nested XML tags (Dimension, Hierarchy, Level) and a Table or Join XML tag. The dimension, when belonging to the cube, is joined to a fact table via a foreignKey attribute that coincides with the name of a column in the fact table. The dimension is identified with the <Dimension> Tag.
The dimension is structured in levels as a way to organize data: to define the relationship between the levels, use the element hierarchy, with the <hierarchy> tag. The hierarchy element can have a primaryKey attribute.
<Dimension name="Customers" foreignKey=”customer_id”> <Hierarchy hasAll="true" primaryKey="customer_id"> <Table name="customer"/> <Level name="Country" column="country" uniqueMembers="true"/> <Level name="Country" column="country" uniqueMembers="true"/> <Level name="City" column="city" uniqueMembers="false"/> ... </Hierarchy> </Dimension>
The hierarchy by default contains a single upper level called All, which is the parent of all the members. Using the allMemberName attribute, it will override the default name of the member. The attribute allLevelName works the same with the level default names.
<Hierarchy hasAll="true" allMemberName="All Customers” primaryKey="customer_id">
The <level> element includes the values of a dimension with similar characteristics. It is identified with the <Level> tag in the hierarchy. Each level in the table is defined by a name and a column; this column attribute is the key of the level. Use the uniqueMembers attribute as “true” when the values in the level’s key column are unique across all members of that level. Otherwise, use “false”.
<Level name="Country" column="country" uniqueMembers="true"/>
The column attribute of the level can be an expression; it is necessary then to use the <KeyExpression> tag inside the level. For example, we can use the following definition to retrieve the "gender" always as uppercase text:
<Level name="Gender" column="gender" uniqueMembers="true"> <KeyExpression> <SQL dialect=”generic”>UPPER(customer.gender)</SQL> </KeyExpression> </Level>
The attibute LevelType determines the level type. For example:
- levelType="TimeYears": indicates that a level refers to years.
- levelType="TimeQuarters": indicates that a level refers to quarters.
- levelType="TimeMonths": Indicates that a level refers to months.
- levelType="TimeDays": Indicates that a level refers to days.
- levelType="Regular": indicates that the level is not related to time.
The measures are defined by the <Measure> tag. This definition includes a name, a column in the fact table and an aggregator. The aggregator is typically sum, but count, min, max, avg and distinct-count are also permitted.
An optional attribute, formatString, may be used to specify how the value will be displayed. For instance, users may not want sales unit to include decimals, or may want only two decimal places to be shown. For further information see Examples: Section 4 Numerical formats.
<Measure name="Unit Sales" column="unit_sales" aggregator="sum" formatString="##,###"/> <Measure name="Store Cost" column="store_cost" aggregator="sum" formatString="#,###.00"/>
The optional dataType attribute of the measure element may have the following values: String, Integer, Numeric, Boolean, Date, Time, and Timestamp. The default value is Numeric .
<Measure name="Unit Sales" column="unit_sales" aggregator="sum" datatype="Integer" formatString="##,###"/>
The column attribute of the measure can be an expression; it this case it is necessary to use the <MeasureExpression> tag. In this example below, sales are included only if they correspond to a promotion sale.
<Measure name="Promotion Sales" aggregator="sum" formatString="#,###.00"> <MeasureExpression> <SQL dialect="generic"> (case when sales_fact_1997.promotion_id = 0 then 0 else sales_fact_1997.store_sales end)<SQL> </MeasureExpression> </Measure>
7 Calculated Measures
Calculated measures are specified with the <CalculatedMember> tag. They allow users to define analytical magnitudes in various phases, as they combine multiple source measures via basic mathematical operations.
For example, the following definition creates a new measure, "Profit", by subtracting "Store Cost" from "Store Sales":
<CalculatedMember name="Profit" dimension="Measures"> <Formula>[Measures].[Store Sales] - [Measures].[Store Cost]</Formula> <CalculatedMemberProperty name="FORMAT_STRING" value="$#,##0.00"/> </CalculatedMember>
Another useful example is derived by calculating the difference between Sales values from different time periods (years, months, etc.):
<CalculatedMember name="Unit Sales last Period" dimension="Measures" formula="COALESCEEMPTY(SEARCH([Measures].[Unit Sales], Time.PREVMEMBER), [Measures].[Unit Sales])" visible="true"> <CalculatedMemberProperty name="FORMAT_STRING" value="##,###"/> </CalculatedMember>
The result will look like this to the end user:
Calculted measures can be hidden using the attribute visible='false' (visible='true' if we want them to be displayed). This attribute is commonly used to hide intermediate calculations that we can use in a later Calculated measure.
Certain special functions are also supported, such as:
- PARALLELPERIOD: A member from a certain time period is specified, and the equivalent member from a previous period is located in the same relative position.
- YTD: Returns all members from the same year as the selected member.
- QTY: Returns all members from the same quarter as the selected member.
- MTD: Returns all members from the same month as the selected member.
- WTD: Returns all members from the same week as the selected member.
- PeriodsToDate: Returns a set of members on the same level as the selected member, starting with the first member on the same level and ending with the member in question. This is done according to the level restriction specified in the Time dimension.
To use ParallelPeriod as an example case:
<CalculatedMember name="Sales prev year value" dimension="Measures" formula="VALUE(PARALLELPERIOD([Time].[Years], 1, Time.CURRENTMEMBER),[Measures].[Sales])" visible="true"> <CalculatedMemberProperty name="FORMAT_STRING" value="$#,##0.00" /> </CalculatedMember>
A cube consists of the grouping of multiple dimensions and measures, as well as a Table or Join element.
It is identified with the <Cube> XML tag. Let's take the following cube as an example:
<Cube name="Store"> <Table name="store"/> <Dimension name="Store Type"> <Hierarchy hasAll="true"> <Level name="Store Type" column="store_type" uniqueMembers="true"/> </Hierarchy> </Dimension> <Dimension name="Position" foreignKey="employee_id"> <Hierarchy hasAll="true" allMemberName="All Position" primaryKey="employee_id"> <Table name="employee"/> <Level name="Management Role" uniqueMembers="true" column="management_role"/> <Level name="Position Title" uniqueMembers="false" column="position_title" ordinalColumn="position_id"/> </Hierarchy> </Dimension> <DimensionUsage name="Store" source="Store" /> <Measure name="Store Sqft" column="store_sqft" aggregator="sum" formatString="#,###.0000" /> </Cube>
Below is a description of the various elements which comprise the cube:
- Table: this tag includes two definitions, fact table of the cube and dimension tables.
- Dimension: these tags set the definition of a dimension (with its elements including Hierarchy, Level, etc.).
- DimensionUsage: this tag configures the usage of a previously-defined dimension outside of the Cube.
- Measure: this tag defines the measures (facts) used in the cube.
8.1 Use of Dimensions
The use of a Dimension tag within the definition of a Cube corresponds to two possible cases: cubes dimensions and shared dimensions.
- Cube dimensions are defined only within the context of a specific cube element. these dimensions can not be used outside the cube where they are defined. For further information, see previous section 5 Dimensions.
- Shared dimensions belong to a schema. They are defined before the cubes have been defined and can be used by several cubes of the same schema. These dimensions can be used in a particular data cube by employing the <DimensionUsage> tag.
<Schema name="FoodMart"> <Dimension name="Time" type="Time"> <Hierarchy hasAll="false" primaryKey="time_id"> ... </Hierarchy> </Dimension> ... <Cube name="Warehouse"> <DimensionUsage name="Time" source="Time" foreignKey="time_id"/> ... </Cube> </Schema>
The foreignKey attribute is usually required in cubes dimensions and in <dimensionUsage> elements, but not in shared dimensions. The foreignKey attribute of the <dimensionUsage> refers to an appropriate column in the fact table. The source attribute relates to the particular dimension, and the name corresponds to the name for the source dimension definition.
8.2 Accelerating a query
When declaring a cube we can specify the acceleration mode with the attribute acceleration. The possible values for this attribute are the following:
- ON: The system tries to accelerate the query.
- OFF (default behaviour): nothing is changed on the DB connection.
- FALLBACK_OFF: the system tries to accelerate the query, and the request fails if it cannot be accelerated. This mode can be useful when certain queries will take long time to complete if not accelerated, preventing long queries on the DB server.
Sample cube definition:
<Schema name='tpch_1'> <Cube name='Line items' useLeftJoins='true' acceleration="OFF|ON|FALLBACK_OFF"> <Table name='lineitem' approxRowCount='100' /> ... </Cube> </Schema>
9 Virtual Cubes
A virtual cube combines two or more ordinary cubes. It is identified with the element <VirtualCube>.
<VirtualCube name="Warehouse and Sales"> <VirtualCubeDimension cubeName="Sales" name="Customers"/> <VirtualCubeDimension name="Store"/> ... </VirtualCube>
Virtual cubes are required often, in one of two possible types of situations. The results are presented to the end user, who lacks in-depth knowledge of or is not involved in how data is structured.
- Situation 1: Fact tables of different levels of granularity are available (e.g., a table at the daily level and another at the monthly level).
- Situation 2: Fact tables with different sets of dimensions are available (e.g., one with Product/Time/Customer and the other with Product/Time/Warehouse).
In Situation 1, all cubes would be used in the database query. Calculation of measures is not possible, because the dimensions and measures selected lack cubes in common.
In Situation 2, one cube would not be selected, because its dimensions would not permit the calculation of all measures.
10 Mapping Geometry
The Olap functionalities allow you to obtain thematic maps that represent the spatial distribution of a measure or a dimension of the cube. This option only exists for those dimensions that contain geospatial information, such as cities, countries, postal codes... The different options of the tool allow you to select the type of visualization on the map, either through points, color areas or gradings. These color gradings express the different intervals of the measure and can be selected by using the Measure Range tag (see example in section 3. Complete example).
In order to display a thematic map, it is necessary to select a single dimension with geospatial characteristics. In addition, the relationship between the columns of the levels of the cube and the place of the system where the geometric information is stored (externally in a data wharehouse) must be defined in the OLAP scheme. The example below shows this relationship:
<Geometries> <Geometry name="Store country (geom)" table="geo_countries" primaryKey="name_0"> <Table name="store" columns="store_country" /> </Geometry> <Geometry name="Store country (geom)" table="geo_states" primaryKey="varname_1"> <Table name="store" columns="store_state" /> </Geometry> <Geometry name="Store city (geom)" table="geo_cities" primaryKey="id_country, id_state, id_city"> <Table name="store" columns="store_country, store_state, store_city" /> </Geometry> </Geometries>