Mining of Massive Datasets – Data Mining

简介:

1 What is Data Mining?

The most commonly accepted definition of “data mining” is the discovery of “models” for data.

 

1.1 Statistical Modeling

Statisticians were the first to use the term “data mining.”

Now, statisticians view data mining as the construction of a statistical model, that is, an underlying distribution(ex. Gaussian distribution) from which the visible data is drawn.

 

1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning. There is no question that some data mining appropriately uses algorithms from machine learning.

Machine-learning practitioners use the data as a training set, to train an algorithm of one of the many types used by machine-learning practitioners, such as Bayes netssupport-vector machinesdecision treeshidden Markov models, and many others.

The typical case where machine learning is a good approach is when we have little idea of what we are looking for in the data.

For example, it is rather unclear what it is about movies that makes certain movie-goers like or dislike it.

机器学习适用于这种搞不清什么样的规则能被mining, 所以只需要把数据feed给ML算法, 它就可以替你做出判断, 而你不用关心这个具体的过程.

 

1.3 Computational Approaches to Modeling

前面谈了两种模型, 用什么方法来discovery模型?

There are many different approaches to modeling data, 这里介绍两种,

1. Summarizing the data succinctly and approximately

2. Extracting the most prominent features of the data and ignoring the rest

下面就来具体谈谈这两种方法.

 

1.4 Summarization

One of the most interesting forms of summarization is the PageRank idea, which made Google successful. The entire complex structure of the Web is summarized by a single number for each page.

Another important form of summary – Clustering. 书中举了个'Plotting cholera cases on a map of London’的例子, 通过简单的手工的plotting, 对点经行聚类建模, 挖掘出了靠近路口更容易得病的规则.

 

1.5 Feature Extraction

A complex relationship between objects is represented by finding the strongest statistical dependencies among these objects and using only those in representing all statistical connections.

 

2 Statistical Limits on Data Mining

A common sort of data-mining problem involves discovering unusual events hidden within massive amounts of data.

但是数据挖掘技术也不是总是有效的, 下面介绍Bonferroni’s Principle来避免滥用这种技术.

 

2.1 Total Information Awareness

In 2002, the Bush administration put forward a plan to mine all the data it could find, including credit-card receipts, hotel records, travel data, and many other kinds of information in order to track terrorist activity.

当然bush这个计划由于隐私问题最终被议会否决了, 但是这儿只是作为个例子来讨论数据挖掘技术是否有效.

 

2.2 Bonferroni’s Principle

Calculate the expected number of occurrences of the events you are looking for, on the assumption that data is random. If this number is significantly larger than the number of real instances you hope to find, then you must expect almost anything you find to be bogus.

In a situation like searching for terrorists, where we expect that there are few terrorists operating at any one time. 
如果我们通过数据挖掘技术, 每天都挖掘出大量的, 上百万起的恐怖事件, 那么这样的技术就是无效的, 哪怕其中确实有若干恐怖事件... 

3 Things Useful to Know

如果你学习数据挖掘, 下面这些基本概念很重要,

1. The TF.IDF measure of word importance. 
2. Hash functions and their use. 
3. Secondary storage (disk) and its effect on running time of algorithms. 
4. The base e of natural logarithms and identities involving that constant. 
5. Power laws.

 

3.1 Importance of Words in Documents

In several applications of data mining, we shall be faced with the problem of categorizing documents (sequences of words) by their topic. Typically, topics are identified by finding the special words that characterize documents about that topic.

这是个相当典型的数据挖掘问题...topic keywords extraction

而最基本的技术就是, TF.IDF (Term Frequency times Inverse Document Frequency).

词频(term frequency,TF)指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被正规化,以防止它偏向长的文件

逆向文件频率(inverse document frequency,IDF)是一个词语普遍重要性的度量。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到

 

3.5 The Base of Natural Logarithms (自然对数)

The constant e = 2.7182818 · · · has a number of useful special properties. In particular, e is the limit of (1 + 1/x)xas x goes to infinity.

e的用处很大, 我不学数学不明白, 这儿通过e可以用近似来简化计算,

1. Consider (1+a)b, where a is small, We can thus approximate (1 + a)b as eab.

2. ex = 1 + x + x2/2 + x3/6 + x4/24 + · · ·

 

3.6 Power Laws (幂法则)

There are many phenomena that relate two variables by a power law, that is, a linear relationship between the logarithms of the variables.

 

这章就是综述性质的,

谈了什么是数据挖掘, 常用的方法思路是什么, 挖掘技术的局限是什么, 常用的基本概念.


本文章摘自博客园,原文发布日期:2011-07-06

目录
相关文章
|
11月前
|
机器学习/深度学习 存储 传感器
Automated defect inspection system for metal surfaces based on deep learning and data augmentation
简述:卷积变分自动编码器(CVAE)生成特定的图像,再使用基于深度CNN的缺陷分类算法进行分类。在生成足够的数据来训练基于深度学习的分类模型之后,使用生成的数据来训练分类模型。
96 0
《How Customers Are Using the IBM Data Science Experience-Expected Cases and Not So Expected Ones》电子版地
How Customers Are Using the IBM Data Science Experience-Expected Cases and Not So Expected Ones
《How Customers Are Using the IBM Data Science Experience-Expected Cases and Not So Expected Ones》电子版地
|
数据采集 算法 数据可视化
Why data mining| 学习笔记
快速学习 Why data mining。
59 0
Why data mining| 学习笔记
|
算法 搜索推荐 数据挖掘
What is data mining| 学习笔记
快速学习 What is data mining。
71 0
What is data mining| 学习笔记
|
搜索推荐 数据挖掘 开发者
Data mining process| 学习笔记
快速学习 Data mining process。
97 0
Data mining process| 学习笔记
Basic Concepts of Genetic Data Analysis
Basic Concepts of Genetic Data Analysis
878 0
|
算法
Reading《Practical lessons from predicting clicks on Ads at Facebook》(1)
版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/sinat_32502811/article/details/80794980 因为在做京东的算法大赛,小白选手,看了一些别人的入门级程序,胡乱改了一通,也没有什么大的进展,而且感觉比赛的问题和点击率预估还是有点像的,所以搜了个论文来读,看看牛人们的思路。
2238 0
The Rising Smart Logistics Industry: How to Use Big Data to Improve Efficiency and Save Costs
This whitepaper will examine Alibaba Cloud’s Cainiao smart logistics cloud and Big Data powered platform and the underlying strategies used to optimiz.
1500 0
The Rising Smart Logistics Industry: How to Use Big Data to Improve Efficiency and Save Costs
|
安全 Go
Improving your Organizations Data Governance Scorecard
This whitepaper looks at how businesses can improve their scores on the tests of three fundamental data governance areas.
1319 0
|
关系型数据库 物联网 对象存储
Sharing, Storing, and Computing Massive Amounts of Data
Data is crucial to the operation of any business.
1600 0