Junk Dimensions(原创)

简介:

Junk Dimension

A junk dimension is a convenient grouping of typically low-cardinality flags and indicators. By creating an abstract dimension, these flags and indicators are removed from the fact table while placing them into a useful dimensional framework.A Junk Dimension is a dimension table consisting of attributes that do not belong in the fact table or in any of the existing dimension tables. The nature of these attributes is usually text or various flags, e.g. non-generic comments or just simple yes/no or true/false indicators. These kinds of attributes are typically remaining when all the obvious dimensions in the business process have been identified and thus the designer is faced with the challenge of where to put these attributes that do not belong in the other dimensions.

One solution is to create a new dimension for each of the remaining attributes, but due to their nature, it could be necessary to create a vast number of new dimensions resulting in a fact table with a very large number of foreign keys. The designer could also decide to leave the remaining attributes in the fact table but this could make the row length of the table unnecessarily large if, for example, the attributes is a long text string.

The solution to this challenge is to identify all the attributes and then put them into one or several Junk Dimensions. One Junk Dimension can hold several true/false or yes/no indicators that have no correlation with each other, so it would be convenient to convert the indicators into a more describing attribute. An example would be an indicator about whether a package had arrived, instead of indicating this as “yes” or “no”, it would be converted into “arrived” or “pending” in the junk dimension. The designer can choose to build the dimension table so it ends up holding all the indicators occurring with every other indicator so that all combinations are covered. This sets up a fixed size for the table itself which would be 2x rows, where x is the number of indicators. This solution is appropriate in situations where the designer would expect to encounter a lot of different combinations and where the possible combinations are limited to an acceptable level. In a situation where the number of indicators are large, thus creating a very big table or where the designer only expect to encounter a few of the possible combinations, it would be more appropriate to build each row in the junk dimension as new combinations are encountered. To limit the size of the tables, multiple junk dimensions might be appropriate in other situations depending on the correlation between various indicators.

Junk dimensions are also appropriate for placing attributes like non-generic comments from the fact table. Such attributes might consist of data from an optional comment field when a customer places an order and as a result will probably be blank in many cases. Therefore the junk dimension should contain a single row representing the blanks as a surrogate key that will be used in the fact table for every row returned with a blank comment field

Instance

In a junk dimension, we combine these indicator fields into a single dimension. This way, we'll only need to build a single dimension table, and the number of fields in the fact table, as well as the size of the fact table, can be decreased. The content in the junk dimension table is the combination of all possible values of the individual indicator fields.

Let's look at an example. Assuming that we have the following fact table:

FACT_TABLE

CUSTOMER_ID
PRODUCT_ID
TXN_ID
STORE_ID
TXN_CODE
COUPON_IND
PREPAY_IND
TXT_AMT

In this example, TXN_CODE, COUPON_IND, and PREPAY_IND are all indicator fields. In this existing format, each one of them is a dimension. Using the junk dimension principle, we can combine them into a single junk dimension, resulting in the following fact table:

FACT_TABLE

CUSTOMER_ID
PRODUCT_ID
TXN_ID
STORE_ID
JUNK_ID
TXT_AMT

Note that now the number of dimensions in the fact table went from 7 to 5.

The content of the junk dimension table would look like the following:

In this case, we have 3 possible values for the TXN_CODE field, 2 possible values for the COUPON_IND field, and 2 possible values for the PREPAY_IND field. This results in a total of 3 x 2 x 2 = 12 rows for the junk dimension table.

By using a junk dimension to replace the 3 indicator fields, we have decreased the number of dimensions by 2 and also decreased the number of fields in the fact table by 2. This will result in a data warehousing environment that offer better performance as well as being easier to manage.

 

参考至:《Star Schema The Complete Reference》

http://www.1keydata.com/datawarehousing/junk-dimension.html

http://en.wikipedia.org/wiki/Dimension_%28data_warehouse%29

本文原创,转载请注明出处,作者

如有错误,欢迎指正

邮箱:czmcj@163.com

作者:czmmiao  文章出处:http://czmmiao.iteye.com/blog/2199484
相关文章
|
8月前
|
机器学习/深度学习 PyTorch 算法框架/工具
【论文精读】ISBI 2022 - Retinal Vessel Segmentation with Pixel-wise Adaptive Filters
由于视网膜血管的纹理复杂和成像对比度低,导致精确的视网膜血管分割具有挑战性。以前的方法通常通过级联多个深度网络来细化分割结果
70 0
|
编解码 算法 数据挖掘
Density- and Grid-Based Methods|学习笔记
快速学习 Density- and Grid-Based Methods
100 0
Density- and Grid-Based Methods|学习笔记
|
机器学习/深度学习 计算机视觉
AMASS: Archive of Motion Capture as Surface Shapes论文精读
2005年,斯坦福大学的研究者们提出了SCAPE模型,2014年基于SCAPE模型马普所的研究者们提出了MoSh模型旨在更精细地利用光学动捕数据生成人体三维模型,2015年马普所的研究者们提出了新的SMPL人体三维模型,此后SMPL模型成为了更为主流的人体三维模型。2019年,为了整合现有光学动捕数据集光学标志点数量和位置不统一的情况,马普所提出了MoSh++方法,整合现有15个数据集,生成大规模人体动作捕捉数据集AMASS。
584 0
|
机器学习/深度学习 编解码 数据可视化
Paper:《How far are we from solving the 2D & 3D Face Alignment problem? 》解读与翻译
Paper:《How far are we from solving the 2D & 3D Face Alignment problem? 》解读与翻译
Paper:《How far are we from solving the 2D & 3D Face Alignment problem? 》解读与翻译
|
机器学习/深度学习 编解码 人工智能
Paper之BigGAN:ICLR 2019最新论文《LARGE SCALE GAN TRAINING FOR HIGH FIDELITY NATURAL IMAGE SYNTHESIS》(未完待续)
Paper之BigGAN:ICLR 2019最新论文《LARGE SCALE GAN TRAINING FOR HIGH FIDELITY NATURAL IMAGE SYNTHESIS》(未完待续)
Paper之BigGAN:ICLR 2019最新论文《LARGE SCALE GAN TRAINING FOR HIGH FIDELITY NATURAL IMAGE SYNTHESIS》(未完待续)
|
机器学习/深度学习 算法 前端开发
文献翻译:Statistical Approaches for Gene Selection, Hub Gene Identification and Module Interaction in...
摘要 信息基因的选择是基因表达研究中的重要问题。基因表达数据的小样本量和大量基因特性使选择过程复杂化。此外,所选择的信息基因可以作为基因共表达网络分析的重要输入。
1149 0
|
数据可视化 数据库
文献翻译Complex integrated analysis of lncRNAs-miRNAs-mRNAs in oral squamous cell carcinoma(2)
材料和方法 获取微阵列数据和选择数据集头颈部鳞状细胞癌的基因表达数据 (HNSCC)从Cancer Genome Atlas下载(TCGA)数据库(https://gdc-portal.nci.nih.gov/)。
1469 0
|
机器学习/深度学习
文献翻译Complex integrated analysis of lncRNAs-miRNAs-mRNAs in oral squamous cell carcinoma(1)
Abstract 目的: 本研究旨在通过基因表达数据揭示口腔鳞状细胞癌(OSCC)中lncRNAs-miRNAs-mRNA的调控网络。 材料与方法: 差异表达的lncRNAs,miRNAs和mRNAs(截止值:假阳性率(FDR) 1.5)。
1054 0