独家 | 基于Python的遗传算法特征约简（附代码）-阿里云开发者社区

作者：Ahmed Gad

翻译：张睿毅

校对：丁楠雅

文章来源：微信公众号数据派THU

本教程主要使用numpy和sklearn来讨论如何使用遗传算法（genetic algorithm，GA）来减少从python中的Fruits360数据集提取的特征向量。

标签：深度学习，特征工程，遗传算法，神经网络，numpy，python，scikit-learn

本教程主要使用numpy和sklearn来讨论如何使用遗传算法（genetic algorithm，GA）来减少从python中的Fruits360数据集提取的特征向量。

导言

在某些情况下，使用原始数据训练机器学习算法可能不是合适的选择。该算法在接受原始数据训练时，必须进行特征挖掘，以检测不同组之间的差异。但这需要大量的数据来自动执行特征挖掘。对于小数据集，数据科学家最好自己进行特征挖掘步骤，之后告诉机器学习算法要使用哪个特征集。

使用的特征集必须能代表数据样本，因此我们必须注意选择最佳特征。数据科学家建议使用一些类型的特征，这些特征似乎有助于根据以前的经验来表示数据样本。一些特征可以证明它们在表示样本时的稳健性，而其他特征则不能。

可能存在一些类型的特征，会降低分类问题的准确性或增加回归问题的误差，进而影响训练模型的结果。例如，特征向量中可能存在一些噪音元素，因此它们应该被删除。特征向量也可能包含2个或更多相关元素。只使用一个元素就可以替代另一个元素。为了删除这些类型的元素，有两个有用的步骤，即特征选择和约简。本教程重点介绍特征约简。

假设有3个特征f1、f2和f3，每个特征都有3个特征元素。因此，特征向量长度为3x3=9。特征选择只选择特定类型的特征，不包括其他类型的特征。例如，只需选择f1和f3并删除f3。特征向量长度变成了6而不是9。在特征约简中，可以排除每个特征的特定元素。例如，此步骤可能会在保留第二个元素的同时从f3中删除第一个和第三个元素。因此，特征向量长度从9减少到7。

在开始本教程之前，值得一提的是，它是我的LinkedIn配置文件中先前发布的2个教程的扩展。

第一个教程的标题是“使用numpy的人工神经网络实现Fruits360图像数据集的分类”。它首先从Fruits360数据集的4个类中提取长度为360的特征向量。然后，利用numpy从零开始构建人工神经网络（ANN），对数据集进行分类。

第一个教程可从以下网址获取：

https://www.linkedin.com/pulse/artificial-neural-network-implementation-using-numpy-fruits360-gad

其Github项目可从以下网址获得：

https://github.com/ahmedfgad/NumPyAN

第二个教程是“使用遗传算法的人工神经网络优化”。建立并使用遗传算法对神经网络参数进行优化，以提高分类精度。

第二个教程可从以下网址获取：

https://www.linkedin.com/pulse/artificial-neural-networks-optimization-using-genetic-ahmed-gad。

其Github项目也可从以下网址获得：

https://github.com/ahmedfgad/NeuralGeneti

本教程讨论了如何使用遗传算法来减少从长度360的Fruits360数据集中提取的特征向量。本教程首先讨论要遵循的步骤。其次通过使用NumPy和Sklearn在python实现这些步骤。

本教程的实现可在我的Github页面中找到：

https://github.com/ahmedfgad/FeatureReductionGeneti

遗传算法从一个初始群体开始，该群体由若干染色体（即解决方法）组成，其中每个染色体都有一系列基因。使用适应函数，遗传算法选择最佳的解决方案作为父母来创建一个新的群体。在这样一个新的群体中，通过在双亲上应用两个操作，即杂交和变异来创建新的解决方案。当把遗传算法应用到一个给定的问题上时，我们必须确定基因的表示、合适的适应函数以及杂交和变异是如何应用的。接下来让我们看看运行原理。

染色体的表达

遗传算法中的基因是染色体的组成部分。首先，我们需要确定染色体内的基因。为此，考虑到可能影响结果的每一种属性都应被视为一个基因。因为我们问题的目标是选择最好的一组特征元素，所以如果选择或不选择，每个特征元素都可能影响结果。因此，每个特征元素都被视为一个基因。染色体将由所有基因（即所有特征元素）组成。因为有360个特征元素，那么就有360个基因。一个很好的信息现在很清楚，染色体的长度是360。

在确定所选基因是什么之后，下一步就是确定基因的表达。有不同的表示形式，如十进制、二进制、浮点、字符串等。我们的目标是知道基因（即特征元素）是否在减少的特征集中被选择。因此，分配给基因的值应该反映它是否被选择。基于这种描述，很明显每个基因有两个可能的值。一个值表示该基因已被选中，另一个值表示未被选中。因此，二进制表示是最佳选择。当基因值为1时，将在减少的特征集中进行选择。当为0时，则忽略它。

总之，染色体将由360个基因组成，以二进制表示。根据下一个图，特征向量和染色体之间有一对一的映射。这是染色体中的第一个基因与特征向量中的第一个元素相连。当该基因的值为1时，这意味着选择了特征向量中的第一个元素。

适应函数

通过了解如何创建染色体，可以很容易地对初始种群进行随机初始化。初始化后，将选择父级。遗传算法基于达尔文的“适者生存”理论。这是目前选择的最佳解决方案进行组合，以产生更好的解决方案。通过保留好的解和消除坏的解，我们可以得到最优或半最优解。

选择双亲的标准是与每个解决方案（即染色体）相关联的适应值。适合度越高，解决方案越好。使用适应度函数计算适应度值。那么，在我们的问题中，最适合使用的功能是什么？我们问题的目标是创建一个约简的特征向量，以提高分类精度。因此，判断一个解是否好的标准是分类的准确性。因此，fitness函数将返回一个数字，指定每个解决方案的分类精度。精度越高，解决方案越好。

为了返回分类的准确度，必须有一个机器学习模型来通过每个解决方案返回的特征元素进行训练。对于这种情况，我们将使用支持向量分类器（SVC）。

数据集分为训练样本和测试样本。根据训练数据，SVC将使用人群中每个解决方案选择的特征元素进行训练。经过训练后，根据测试数据进行测试。

根据每个解的适合度值，我们可以选择其中最好的作为父母。这些父母被放在一起组合以产生后代，这将是下一代的新人口的成员。这种后代是通过对选定的亲本应用交叉和突变操作而产生的。让我们按照下面讨论的方式配置这些操作。

遗传和变异

基于适应度函数，我们可以筛选出当前群体中的最优解，即父辈。遗传算法假设匹配2个好的解决方案将产生第三个更好的解决方案。组合意味着从两个父母那里交换一些基因。使用遗传操作交换基因。有不同的方法可以应用这种操作。本教程使用单点交叉，其中一个点分割染色体。点前的基因取自一组解，点后的基因取自另一组解。

通过应用遗传，所有的基因都来自于以前的父母。在新的后代中没有引入新的基因。如果所有的父母都有一个坏基因，那么这个基因就会转移到后代身上。正因为如此，为了在后代中引入新的基因，采用了突变操作。在基因的二元表示中，突变是通过翻转一些随机选择的基因的值来实现的。如果基因值为1，则为0，反之亦然。

在产生后代之后，我们可以创造下一代的新种群。除了后代之外，这个群体还包括以前的父辈。

此时，将讨论所有步骤。接下来是用Python实现它们。注意，我以前写过一篇题为“Python中的遗传算法实现”的教程，用于在Python中实现遗传算法，我将修改它的代码来解决我们的问题。最好读一下。

利用Python实现

该项目分为两个文件。一个文件名为GA.py，它将遗传算法步骤的实现保存为函数。另一个文件是主文件，它只导入这个文件，并在循环中调用它的函数，该循环将迭代几代。

根据下面的代码，主文件首先读取从Fruits360数据集提取的特性。这些特性返回到数据输入变量中。有关提取这些功能的详细信息，请参阅本教程开头提到的2个教程。该文件还读取与数据输出变量中的样本相关联的类标签。

选择一些样本进行训练，其索引存储在train_indices变量中。同样，测试样本索引存储在test_indices变量中。


import numpy
import GA
import pickle
import matplotlib.pyplot

f = open("dataset_features.pkl", "rb")
data_inputs = pickle.load(f)
f.close()

f = open("outputs.pkl", "rb")
data_outputs = pickle.load(f)
f.close()

num_samples = data_inputs.shape[0]
num_feature_elements = data_inputs.shape[1]
train_indices = numpy.arange(1, num_samples, 4)
test_indices = numpy.arange(0, num_samples, 4)

print("Number of training samples: ", train_indices.shape[0])
print("Number of test samples: ", test_indices.shape[0])

"""
Genetic algorithm parameters:
    Population size
    Mating pool size
    Number of mutations
"""
sol_per_pop = 8 # Population size.
num_parents_mating = 4 # Number of parents inside the mating pool.
num_mutations = 3 # Number of elements to mutate.

# Defining the population shape.
pop_shape = (sol_per_pop, num_feature_elements)
# Creating the initial population.
new_population = numpy.random.randint(low=0, high=2, size=pop_shape)
print(new_population.shape)
best_outputs = []
num_generations = 100

它初始化了遗传算法的所有参数。这包括根据sol_per_pop变量设置为8的每个群体的解的数量、num _parents_mating变量设置为4的子代数量以及num_mutations变量设置为3的突变数量。之后，它会在一个名为“new_population”的变量中随机创建初始总体。

有一个名为best_outputs的空列表，它在每一代之后都保存着最好的结果。这有助于可视化遗传算法在完成所有代之后的进展。num_generations变量中的代数设置为100。请注意，您可以更改所有这些参数，从而获得更好的结果。

在准备好特性、类标签和算法参数之后，我们可以根据下一个代码对算法进行迭代。首先，通过调用GA文件中定义的名为cal_pop_fitness（）的适应函数来计算所有解决方案的适应值。此函数接受当前总体、提取的特征、类标签、列车索引和测试索引。函数返回名为fitness的变量中所有解的适应值。请记住，适合度值表示分类精度。最佳（即最高）分类精度保存在最佳输出列表中。

根据计算出的适合度值，使用GA.py文件中定义的select_matching_pool（）函数选择分类精度最高的最佳解决方案作为匹配池中的父级。它接受当前的人口、适合度值和要返回的父母人数。它将所选双亲返回到父级变量中。

for generation in range(num_generations):
    print("Generation : ", generation)
    
    # Measuring the fitness of each chromosome in the population.
    fitness = GA.cal_pop_fitness(new_population, data_inputs, data_outputs, train_indices, test_indices)
    best_outputs.append(numpy.max(fitness))

    # The best result in the current iteration.
    print("Best result : ", best_outputs[-1])
    
    # Selecting the best parents in the population for mating
    parents = GA.select_mating_pool(new_population, fitness, num_parents_mating)
   
    # Generating next generation using crossover
    offspring_crossover = GA.crossover(parents, offspring_size=(pop_shape[0]-parents.shape[0], num_feature_elements))
    
    # Adding some variations to the offspring using mutation.
    offspring_mutation = GA.mutation(offspring_crossover, num_mutations=num_mutations)

    # Creating the new population based on the parents and offspring.
    new_population[0:parents.shape[0], :] = parents
    new_population[parents.shape[0]:, :] = offspring_mutation

接下来是对选定的父代应用组合操作以创建子代。这是在GA.py文件中定义的crossover（）函数内完成的。它接受父数组和子数组的形状，以便稍后返回到offspring_crossover变量中。然后，使用在GA.py文件中也可用的mutation（）函数在该数组上应用突变操作。除了交叉结果，这个函数接受突变的数量。

因为新的种群由选定的亲本和后代组成，所以亲本和offspring_crossover数组都保存到new_population变量中。在那之后，新一代被应用于新的人口。

在所有代完成后，将执行下一个代码，以返回最佳选择的功能元素集和所选元素的数量。在100代完成后，该算法使用174个特征元素，以达到99.59%的精度。

fitness = GA.cal_pop_fitness(new_population, data_inputs, data_outputs, train_indices, test_indices)
# Then return the index of that solution corresponding to the best fitness.
best_match_idx = numpy.where(fitness == numpy.max(fitness))[0]

best_match_idx = best_match_idx[0]
best_solution = new_population[best_match_idx, :]
best_solution_indices = numpy.where(best_solution == 1)[0]
best_solution_num_elements = best_solution_indices.shape[0]
best_solution_fitness = fitness[best_match_idx]

print("best_match_idx : ", best_match_idx)
print("best_solution : ", best_solution)
print("Selected indices : ", best_solution_indices)
print("Number of selected elements : ", best_solution_num_elements)
print("Best solution fitness : ", best_solution_fitness)

matplotlib.pyplot.plot(best_outputs)
matplotlib.pyplot.xlabel("Iteration")
matplotlib.pyplot.ylabel("Fitness")
matplotlib.pyplot.show()

上面的代码展示了一张图，显示了算法在所有代中的进度，如下所示。

以下是主文件中的完整代码。

import numpy
import GA
import pickle
import matplotlib.pyplot

f = open("dataset_features.pkl", "rb")
data_inputs = pickle.load(f)
f.close()

f = open("outputs.pkl", "rb")
data_outputs = pickle.load(f)
f.close()

num_samples = data_inputs.shape[0]
num_feature_elements = data_inputs.shape[1]
train_indices = numpy.arange(1, num_samples, 4)
test_indices = numpy.arange(0, num_samples, 4)

print("Number of training samples: ", train_indices.shape[0])
print("Number of test samples: ", test_indices.shape[0])

"""
Genetic algorithm parameters:
    Population size
    Mating pool size
    Number of mutations

"""

sol_per_pop = 8 # Population size
num_parents_mating = 4 # Number of parents inside the mating pool.
num_mutations = 3 # Number of elements to mutate.

# Defining the population shape.
pop_shape = (sol_per_pop, num_feature_elements)

# Creating the initial population.
new_population = numpy.random.randint(low=0, high=2, size=pop_shape)
print(new_population.shape)
best_outputs = []
num_generations = 100

for generation in range(num_generations):

    print("Generation : ", generation)
    # Measuring the fitness of each chromosome in the population.
    fitness = GA.cal_pop_fitness(new_population, data_inputs, data_outputs, train_indices, test_indices)
    best_outputs.append(numpy.max(fitness))

    # The best result in the current iteration.
    print("Best result : ", best_outputs[-1])

    # Selecting the best parents in the population for mating.
    parents = GA.select_mating_pool(new_population, fitness, num_parents_mating)
   
    # Generating next generation using crossover.
    offspring_crossover = GA.crossover(parents, offspring_size=(pop_shape[0]-parents.shape[0], num_feature_elements))

    # Adding some variations to the offspring using mutation.
    offspring_mutation = GA.mutation(offspring_crossover, num_mutations=num_mutations)

    # Creating the new population based on the parents and offspring.
    new_population[0:parents.shape[0], :] = parents
    new_population[parents.shape[0]:, :] = offspring_mutation
 
# Getting the best solution after iterating finishing all generations.
# At first, the fitness is calculated for each solution in the final generation.
fitness = GA.cal_pop_fitness(new_population, data_inputs, data_outputs, train_indices, test_indices)

# Then return the index of that solution corresponding to the best fitness.
best_match_idx = numpy.where(fitness == numpy.max(fitness))[0]
best_match_idx = best_match_idx[0]

best_solution = new_population[best_match_idx, :]
best_solution_indices = numpy.where(best_solution == 1)[0]
best_solution_num_elements = best_solution_indices.shape[0]
best_solution_fitness = fitness[best_match_idx]

print("best_match_idx : ", best_match_idx)
print("best_solution : ", best_solution)
print("Selected indices : ", best_solution_indices)
print("Number of selected elements : ", best_solution_num_elements)
print("Best solution fitness : ", best_solution_fitness)

matplotlib.pyplot.plot(best_outputs)
matplotlib.pyplot.xlabel("Iteration")
matplotlib.pyplot.ylabel("Fitness")
matplotlib.pyplot.show()

GA.py的实现

GA.py文件的实现如下所示。在cal_pop_fitness（）函数中，SVC根据每个解决方案选择的特征元素进行培训。在训练前，根据所选的基因值为1的元素过滤特征。这是在reduce_features（）函数中完成的。除了所有示例的完整功能外，它还接受当前的解决方案。

训练后，使用reduce_features（）函数计算分类精度。此函数返回存储在cal pop_fitness（）函数中名为accuracies的数组中的精度。

crossover（）和mutation（）函数的实现与我之前的教程“Python中的遗传算法实现”中讨论的非常相似。一个主要的区别是，mutation（）函数通过翻转随机选择的基因的值来改变它们，因为我们使用的是二进制表示。

import numpy
import sklearn.svm
def reduce_features(solution, features):
    selected_elements_indices = numpy.where(solution == 1)[0]
    reduced_features = features[:, selected_elements_indices]
    return reduced_features

def classification_accuracy(labels, predictions):
    correct = numpy.where(labels == predictions)[0]
    accuracy = correct.shape[0]/labels.shape[0]
    return accuracy

def cal_pop_fitness(pop, features, labels, train_indices, test_indices):
    accuracies = numpy.zeros(pop.shape[0])
    idx = 0
    for curr_solution in pop:
        reduced_features = reduce_features(curr_solution, features)
        train_data = reduced_features[train_indices, :]
        test_data = reduced_features[test_indices, :]
        train_labels = labels[train_indices]
        test_labels = labels[test_indices]
        SV_classifier = sklearn.svm.SVC(gamma='scale')
        SV_classifier.fit(X=train_data, y=train_labels)
        predictions = SV_classifier.predict(test_data)
        accuracies[idx] = classification_accuracy(test_labels, predictions)
        idx = idx + 1
    return accuracies

def select_mating_pool(pop, fitness, num_parents):
    # Selecting the best individuals in the current generation as parents for producing the offspring of the next generation.
    parents = numpy.empty((num_parents, pop.shape[1]))
    for parent_num in range(num_parents):
        max_fitness_idx = numpy.where(fitness == numpy.max(fitness))
        max_fitness_idx = max_fitness_idx[0][0]
        parents[parent_num, :] = pop[max_fitness_idx, :]
        fitness[max_fitness_idx] = -99999999999
    return parents

def crossover(parents, offspring_size):
    offspring = numpy.empty(offspring_size)
    # The point at which crossover takes place between two parents. Usually, it is at the center.
    crossover_point = numpy.uint8(offspring_size[1]/2)

    for k in range(offspring_size[0]):
        # Index of the first parent to mate.
        parent1_idx = k%parents.shape[0]
        # Index of the second parent to mate.
        parent2_idx = (k+1)%parents.shape[0]
        
        # The new offspring will have its first half of its genes taken from the first parent.
        offspring[k, 0:crossover_point] = parents[parent1_idx, 0:crossover_point]
        # The new offspring will have its second half of its genes taken from the second parent.
        offspring[k, crossover_point:] = parents[parent2_idx, crossover_point:]
    return offspring

def mutation(offspring_crossover, num_mutations=2):
    mutation_idx = numpy.random.randint(low=0, high=offspring_crossover.shape[1], size=num_mutations)
    # Mutation changes a single gene in each offspring randomly.
    for idx in range(offspring_crossover.shape[0]):
        # The random value to be added to the gene.
         offspring_crossover[idx, mutation_idx] = 1 - offspring_crossover[idx, mutation_idx]
    return offspring_crossover

原文标题：

Feature Reduction using Genetic Algorithm with Python

原文链接：

https://www.kdnuggets.com/2019/03/feature-reduction-genetic-algorithm-python.html

译者简介

张睿毅，北京邮电大学大二物联网在读。我是一个爱自由的人。在邮电大学读第一年书我就四处跑去蹭课，折腾整一年惊觉，与其在当下焦虑，不如在前辈中沉淀。于是在大二以来，坚持读书，不敢稍歇。资本主义国家的科学观不断刷新我的认知框架，同时因为出国考试很早出分，也更早地感受到自己才是那个一直被束缚着的人。太多真英雄在社会上各自闪耀着光芒。这才开始，立志终身向遇到的每一个人学习。做一个纯粹的计算机科学里面的小学生。

翻译组招募信息

工作内容：需要一颗细致的心，将选取好的外文文章翻译成流畅的中文。如果你是数据科学/统计学/计算机类的留学生，或在海外从事相关工作，或对自己外语水平有信心的朋友欢迎加入翻译小组。

你能得到：定期的翻译培训提高志愿者的翻译水平，提高对于数据科学前沿的认知，海外的朋友可以和国内技术应用发展保持联系，THU数据派产学研的背景为志愿者带来好的发展机遇。

其他福利：来自于名企的数据科学工作者，北大清华以及海外等名校学生他们都将成为你在翻译小组的伙伴。

独家 | 基于Python的遗传算法特征约简（附代码）

导言

更多关于遗传算法的信息

染色体的表达

适应函数

遗传和变异

利用Python实现

GA.py的实现

译者简介

数据派

热门文章

最新文章

相关课程

相关电子书

相关实验场景