Mining of Massive Datasets – Data Mining-阿里云开发者社区

Mining of Massive Datasets – Data Mining

2017-05-02 1676

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

1 What is Data Mining?

The most commonly accepted definition of “data mining” is the discovery of “models” for data.

1.1 Statistical Modeling

Statisticians were the first to use the term “data mining.”

Now, statisticians view data mining as the construction of a statistical model, that is, an underlying distribution(ex. Gaussian distribution) from which the visible data is drawn.

1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning. There is no question that some data mining appropriately uses algorithms from machine learning.

Machine-learning practitioners use the data as a training set, to train an algorithm of one of the many types used by machine-learning practitioners, such as Bayes nets, support-vector machines, decision trees, hidden Markov models, and many others.

The typical case where machine learning is a good approach is when we have little idea of what we are looking for in the data.

For example, it is rather unclear what it is about movies that makes certain movie-goers like or dislike it.

机器学习适用于这种搞不清什么样的规则能被mining, 所以只需要把数据feed给ML算法, 它就可以替你做出判断, 而你不用关心这个具体的过程.

1.3 Computational Approaches to Modeling

前面谈了两种模型, 用什么方法来discovery模型?

There are many different approaches to modeling data, 这里介绍两种,

1. Summarizing the data succinctly and approximately

2. Extracting the most prominent features of the data and ignoring the rest

下面就来具体谈谈这两种方法.

1.4 Summarization

One of the most interesting forms of summarization is the PageRank idea, which made Google successful. The entire complex structure of the Web is summarized by a single number for each page.

Another important form of summary – Clustering. 书中举了个'Plotting cholera cases on a map of London’的例子, 通过简单的手工的plotting, 对点经行聚类建模, 挖掘出了靠近路口更容易得病的规则.

1.5 Feature Extraction

A complex relationship between objects is represented by finding the strongest statistical dependencies among these objects and using only those in representing all statistical connections.

2 Statistical Limits on Data Mining

A common sort of data-mining problem involves discovering unusual events hidden within massive amounts of data.

但是数据挖掘技术也不是总是有效的, 下面介绍Bonferroni’s Principle来避免滥用这种技术.

2.1 Total Information Awareness

In 2002, the Bush administration put forward a plan to mine all the data it could find, including credit-card receipts, hotel records, travel data, and many other kinds of information in order to track terrorist activity.

当然bush这个计划由于隐私问题最终被议会否决了, 但是这儿只是作为个例子来讨论数据挖掘技术是否有效.

2.2 Bonferroni’s Principle

Calculate the expected number of occurrences of the events you are looking for, on the assumption that data is random. If this number is significantly larger than the number of real instances you hope to find, then you must expect almost anything you find to be bogus.

In a situation like searching for terrorists, where we expect that there are few terrorists operating at any one time.
如果我们通过数据挖掘技术, 每天都挖掘出大量的, 上百万起的恐怖事件, 那么这样的技术就是无效的, 哪怕其中确实有若干恐怖事件...

3 Things Useful to Know

如果你学习数据挖掘, 下面这些基本概念很重要,

1. The TF.IDF measure of word importance.
2. Hash functions and their use.
3. Secondary storage (disk) and its effect on running time of algorithms.
4. The base e of natural logarithms and identities involving that constant.
5. Power laws.

3.1 Importance of Words in Documents

In several applications of data mining, we shall be faced with the problem of categorizing documents (sequences of words) by their topic. Typically, topics are identified by finding the special words that characterize documents about that topic.

这是个相当典型的数据挖掘问题...topic keywords extraction

而最基本的技术就是, TF.IDF (Term Frequency times Inverse Document Frequency).

词频（term frequency，TF）指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被正规化，以防止它偏向长的文件

逆向文件频率（inverse document frequency，IDF）是一个词语普遍重要性的度量。某一特定词语的IDF，可以由总文件数目除以包含该词语之文件的数目，再将得到的商取对数得到

3.5 The Base of Natural Logarithms (自然对数)