0
0
0
1. 云栖社区>
2. 博客>
3. 正文

## 机器学习之使用sklearn构造决策树模型

1
2
3
4
import matplotlib.pyplot as plt
import pandas as pd

%matplotlib inline

1
2
3
from sklearn.datasets.california_housing import fetch_california_housing
housing = fetch_california_housing()
print(housing.DESCR)

.. _california_housing_dataset:

## California Housing dataset

Data Set Characteristics:

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
- MedInc        median income in block
- HouseAge      median house age in block
- AveRooms      average number of rooms
- AveBedrms     average number of bedrooms
- Population    block population
- AveOccup      average house occupancy
- Latitude      house block latitude
- Longitude     house block longitude

:Missing Attribute Values: None


This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

:func:sklearn.datasets.fetch_california_housing function.

.. topic:: References

- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297

1
housing.data.shape
(20640, 8)

1
housing.data[0]
array([ 8.3252 , 41. , 6.98412698, 1.02380952,

    322.        ,    2.55555556,   37.88      , -122.23      ])

（1）criterion gini or entropy 基尼系数或者熵
（2）splitter best or random 前者是在所有特征中找最好的切分点 后者是在部分特征中（数据量大的时候）
（3）max_features： None（所有），log2，sqrt，N。特征小于50的时候一般使用所有的特征
（4）max_depth 数据少或者特征少的时候可以不管这个值，如果模型样本量多，特征也多的情况下，可以尝试限制下这个决策树的深度。可以尝试遍历max_depth找出最佳。（最常用参数之一）
（5）min_samples_split 如果某节点的样本数少于min_samples_split，则不会继续再尝试选择最优特征来进行划分如果样本量不大，不需要管这个值。如果样本量数量级非常大，则推荐增大这个值。（最常用参数之一）
（6）min_samples_leaf 这个值限制了叶子节点最少的样本数，如果某叶子节点数目小于样本数，则会和兄弟节点一起被剪枝，如果样本量不大，不需要管这个值，大些如10W可是尝试下
（7）min_weight_fraction_leaf 这个值限制了叶子节点所有样本权重和的最小值，如果小于这个值，则会和兄弟节点一起被剪枝默认是0，就是不考虑权重问题。一般来说，如果我们有较多样本有缺失值，或者分类树样本的分布类别偏差很大，就会引入样本权重，这时我们就要注意这个值了。
（8）max_leaf_nodes 通过限制最大叶子节点数，可以防止过拟合，默认是"None”，即不限制最大的叶子节点数。如果加了限制，算法会建立在最大叶子节点数内最优的决策树。如果特征不多，可以不考虑这个值，但是如果特征分成多的话，可以加以限制具体的值可以通过交叉验证得到。
（9）class_weight 指定样本各类别的的权重，主要是为了防止训练集某些类别的样本过多导致训练的决策树过于偏向这些类别。这里可以自己指定各个样本的权重如果使用“balanced”，则算法会自己计算权重，样本量少的类别所对应的样本权重会高。
（10）min_impurity_split 这个值限制了决策树的增长，如果某节点的不纯度(基尼系数，信息增益，均方差，绝对差)小于这个阈值则该节点不再生成子节点。即为叶子节点 。
（11）n_estimators:要建立树的个数

1
2
3
from sklearn import tree # 导入指定模块
dtr = tree.DecisionTreeRegressor(max_depth=2) # 决策分类
dtr.fit(housing.data[:, [6, 7]], housing.target) # x，y值

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,

                  max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')

1
2
3
4
5
6
7
8
9
10
11
12
13
14

# 设置临时环境遍历

import os
os.environ["PATH"] += os.pathsep + 'D:/program files (x86)/Graphviz2.38/bin/' #注意修改你的路径

dot_data = tree.export_graphviz(dtr, # 注意这个参数为决策树对象名称

                            out_file=None,
feature_names=housing.feature_names[6:8], # 还需要指定特征名
filled=True,
impurity=False,
rounded=True)

1
2
3
4
5
6
7

# pip install pydotplus

import pydotplus

graph = pydotplus.graph_from_dot_data(dot_data)
graph.get_nodes()[7].set_fillcolor("#FFF2DD")
from IPython.display import Image
Image(graph.create_png())

1
2
3
4
5
6
7
from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(

housing.data, housing.target, test_size=0.1, random_state=42)

dtr = tree.DecisionTreeRegressor(random_state=42)
dtr.fit(data_train, target_train)

dtr.score(data_test, target_test)

0.637355881715626

1
2
3
4
5
from sklearn.ensemble import RandomForestRegressor # Regressor 回归

# random_state就是为了保证程序每次运行都分割一样的训练集和测试集

rfr = RandomForestRegressor( random_state = 42)
rfr.fit(data_train, target_train)
rfr.score(data_test, target_test)

0.7910601348350835
GridSearchCV，它存在的意义就是自动调参，只要把参数输进去，就能给出最优化的结果和参数。但是这个方法适合于小数据集，一旦数据的量级上去了，很难得出结果。相当于循环遍历给出的所有的参数来得到最优的结果，十分的耗时。

1
2
3
4
5
6
7
8
9
10
11

# 新的CV迭代器的接口与这个模块的接口不同。sklearn.grid_search将在0.20中被删除。

from sklearn.model_selection import GridSearchCV
tree_param_grid = {'min_samples_split':list((3,6,9)),'n_estimators':list((10,50,100))}

# cv 交叉验证(Cross-validation)的简写 代表进行几次交叉验证

grid = GridSearchCV(RandomForestRegressor(),param_grid=tree_param_grid,cv=5)
grid.fit(data_train,target_train)

# grid_scores_在sklearn0.20版本中已被删除，取而代之的是cv_results_。

grid.cv_results_, grid.best_params_, grid.best_score_

({'mean_fit_time': array([0.91196742, 4.46895003, 8.89996696, 0.90845881, 4.01207662,

     9.11067271, 0.84911356, 4.16957936, 8.08404155]),

'std_fit_time': array([0.04628971, 0.19323399, 0.36771072, 0.07048984, 0.05280237,

     0.55379083, 0.0599862 , 0.19719896, 0.34949627]),

'mean_score_time': array([0.00918159, 0.0467237 , 0.08795581, 0.00958099, 0.03958073,

     0.08624392, 0.01018567, 0.03616033, 0.06846623]),

'std_score_time': array([0.00367907, 0.00559777, 0.00399863, 0.00047935, 0.00082726,

     0.0135891 , 0.0003934 , 0.0052837 , 0.00697507]),

'param_min_samples_split': masked_array(data=[3, 3, 3, 6, 6, 6, 9, 9, 9],

           mask=[False, False, False, False, False, False, False, False,
False],
fill_value='?',
dtype=object),

'param_n_estimators': masked_array(data=[10, 50, 100, 10, 50, 100, 10, 50, 100],

           mask=[False, False, False, False, False, False, False, False,
False],
fill_value='?',
dtype=object),

'params': [{'min_samples_split': 3, 'n_estimators': 10},
{'min_samples_split': 3, 'n_estimators': 50},
{'min_samples_split': 3, 'n_estimators': 100},
{'min_samples_split': 6, 'n_estimators': 10},
{'min_samples_split': 6, 'n_estimators': 50},
{'min_samples_split': 6, 'n_estimators': 100},
{'min_samples_split': 9, 'n_estimators': 10},
{'min_samples_split': 9, 'n_estimators': 50},
{'min_samples_split': 9, 'n_estimators': 100}],
'split0_test_score': array([0.79254741, 0.80793267, 0.81163631, 0.78859073, 0.81211894,

     0.81222231, 0.79241065, 0.80784586, 0.80958409]),

'split1_test_score': array([0.77856084, 0.80047265, 0.80266101, 0.78474831, 0.79898533,

     0.80203702, 0.77912397, 0.79714354, 0.80029259]),

'split2_test_score': array([0.78105784, 0.80063538, 0.8052804 , 0.78584898, 0.8036029 ,

     0.80240046, 0.78148243, 0.79955117, 0.80072995]),

'split3_test_score': array([0.79582001, 0.80687008, 0.8100583 , 0.79947207, 0.80958334,

     0.80851996, 0.78633104, 0.80797192, 0.81129754]),

'split4_test_score': array([0.79103059, 0.8071791 , 0.81016989, 0.78011578, 0.80719335,

     0.81117408, 0.79282783, 0.8064226 , 0.8085679 ]),

'mean_test_score': array([0.78780359, 0.80461815, 0.80796138, 0.78775522, 0.80629709,

     0.80727103, 0.7864355 , 0.80378723, 0.8060946 ]),

'std_test_score': array([0.00675437, 0.00333656, 0.00340769, 0.00646542, 0.00460913,

     0.00429948, 0.00556   , 0.00453896, 0.00464337]),

'rank_test_score': array([7, 5, 1, 8, 3, 2, 9, 6, 4])},
{'min_samples_split': 3, 'n_estimators': 100},
0.8079613788142571)

1
2
3
rfr = RandomForestRegressor( min_samples_split=3,n_estimators = 100,random_state = 42)
rfr.fit(data_train, target_train)
rfr.score(data_test, target_test)

0.8088623476993486

+ 关注