Hive中近似计算Histogram的验证-阿里云开发者社区

Hive中近似计算Histogram的验证

2017-08-09 5993

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

Histogram可以更直观的反映数据的分布情况，有了Histogram就可以对执行参数和执行计划有着更有针对性的优化。但想要得到准确的Histogram，需要巨大的计算量。如果能近似得到相对准确Histogram，就会变得很有价值。
目前HIVE中实现了针对Numeric的近似的Histogram的计算逻辑。NumericHistogram的实现说明如下：

/**
 * A generic, re-usable histogram class that supports partial aggregations.
 * The algorithm is a heuristic adapted from the following paper:
 * Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
 * J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
 * guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
 * of histogram bins.
 */

感兴趣的可以参考论文，“A streaming parallel decision tree algorithm”。

我简单的测试了下：

package sunwg.test;

public class testHis {

    public static void main(String[] args) {

        NumericHistogram numericHistogram = new NumericHistogram();
        numericHistogram.allocate(10);
        
        for (double i=1.0; i<=50.0; i++) {
            numericHistogram.add(i);
        }
                
        System.out.println(Math.round(numericHistogram.quantile(0.1)));
        System.out.println(Math.round(numericHistogram.quantile(0.2)));
        System.out.println(Math.round(numericHistogram.quantile(0.3)));
        System.out.println(Math.round(numericHistogram.quantile(0.4)));
        System.out.println(Math.round(numericHistogram.quantile(0.5)));
        System.out.println(Math.round(numericHistogram.quantile(0.6)));
        System.out.println(Math.round(numericHistogram.quantile(0.7)));
        System.out.println(Math.round(numericHistogram.quantile(0.8)));
        System.out.println(Math.round(numericHistogram.quantile(0.9)));
        System.out.println(Math.round(numericHistogram.quantile(1.0)));
}

结果如下：

基本上还是挺靠谱的，如果想提高准确率，可以增加num_bins的个数，也就是上面的10。

numericHistogram.allocate(10);

并且，NumericHistogram也支持多个partial Histogram的merge操作。

之所以要看这些内容，主要希望数据集成可以通过对数据的研究，获得数据的特征，选择更合适的splitpk，将任务可以拆分得更加平均，减少长尾task，也把用户从优化中解放出来。

Hive中近似计算Histogram的验证

热门文章

最新文章

相关电子书