Spark-SparkSQL深入学习系列八（转自OopsOutOfMemory）-阿里云开发者社区

在SQL的世界里，除了官方提供的常用的处理函数之外，一般都会提供可扩展的对外自定义函数接口，这已经成为一种事实的标准。

在前面Spark SQL源码分析之核心流程一文中，已经介绍了Spark SQL Catalyst Analyzer的作用，其中包含了ResolveFunctions这个解析函数的功能。但是随着Spark1.1版本的发布，Spark SQL的代码有很多新完善和新功能了，和我先前基于1.0的源码分析多少有些不同，比如支持UDF：

spark1.0及以前的实现：

[java]view plain copy 
    
 protected[sql] lazy val catalog: Catalog = new SimpleCatalog   
 @transient   
 protected[sql] lazy val analyzer: Analyzer =   
   new Analyzer(catalog, EmptyFunctionRegistry, caseSensitive = true) //EmptyFunctionRegistry空实现   
 @transient   
 protected[sql] val optimizer = Optimizer   

Spark1.1及以后的实现：

[java]view plain copy 
    
 protected[sql] lazy val functionRegistry: FunctionRegistry = new SimpleFunctionRegistry //SimpleFunctionRegistry实现，支持简单的UDF   
    
 @transient   
 protected[sql] lazy val analyzer: Analyzer =   
   new Analyzer(catalog, functionRegistry, caseSensitive = true)   

一、引子：

对于SQL语句中的函数，会经过SqlParser的的解析成UnresolvedFunction。UnresolvedFunction最后会被Analyzer解析。

SqlParser：

除了非官方定义的函数外，还可以定义自定义函数，sql parser会进行解析。

[java]view plain copy 
    
 ident ~ "(" ~ repsep(expression, ",") <~ ")" ^^ {   
     case udfName ~ _ ~ exprs => UnresolvedFunction(udfName, exprs)   

将SqlParser传入的udfName和exprs封装成一个class class UnresolvedFunction继承自Expression。

只是这个Expression的dataType等一系列属性和eval计算方法均无法访问，强制访问会抛出异常，因为它没有被Resolved，只是一个载体。

[java]view plain copy 
    
 case class UnresolvedFunction(name: String, children: Seq[Expression]) extends Expression {   
   override def dataType = throw new UnresolvedException(this, "dataType")   
   override def foldable = throw new UnresolvedException(this, "foldable")   
   override def nullable = throw new UnresolvedException(this, "nullable")   
   override lazy val resolved = false   
    
   // Unresolved functions are transient at compile time and don't get evaluated during execution.   
   override def eval(input: Row = null): EvaluatedType =   
     throw new TreeNodeException(this, s"No function to evaluate expression. type: ${this.nodeName}")   
    
   override def toString = s"'$name(${children.mkString(",")})"   
 }<strong></strong>   

Analyzer：

Analyzer初始化的时候会需要Catalog，database和table的元数据关系，以及FunctionRegistry来维护UDF名称和UDF实现的元数据，这里使用SimpleFunctionRegistry。

[java]view plain copy 
    
 /**  
  * Replaces [[UnresolvedFunction]]s with concrete [[catalyst.expressions.Expression Expressions]].  
  */   
 object ResolveFunctions extends Rule[LogicalPlan] {   
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {   
     case q: LogicalPlan =>   
       q transformExpressions { //对当前LogicalPlan进行transformExpressions操作   
         case u @ UnresolvedFunction(name, children) if u.childrenResolved => //如果遍历到了UnresolvedFunction   
           registry.lookupFunction(name, children) //从UDF元数据表里查找udf函数   
       }   
   }   
 }   

二、UDF注册

2.1 UDFRegistration

registerFunction("len", (x:String)=>x.length)

registerFunction是UDFRegistration下的方法，SQLContext现在实现了UDFRegistration这个trait，只要导入SQLContext，即可以使用udf功能。

UDFRegistration核心方法registerFunction：

registerFunction方法签名def registerFunction[T: TypeTag](name: String, func: Function1[_, T]): Unit

接受一个udfName 和一个FunctionN，可以是Function1 到Function22。即这个udf的参数只支持1-22个。（scala的痛啊）

内部builder通过ScalaUdf来构造一个Expression，这里ScalaUdf继承自Expression（可以简单的理解目前的SimpleUDF即是一个Catalyst的一个Expression），传入scala的function作为UDF的实现，并且用反射检查字段类型是否是Catalyst允许的，见ScalaReflection.

[java]view plain copy 
    
 def registerFunction[T: TypeTag](name: String, func: Function1[_, T]): Unit = {   
 def builder(e: Seq[Expression]) = ScalaUdf(func, ScalaReflection.schemaFor(typeTag[T]).dataType, e)//构造Expression   
 functionRegistry.registerFunction(name, builder)//向SQLContext的functionRegistry（维护了一个hashMap来管理udf映射）注册   

2.2 注册Function：

注意：这里FunctionBuilder是一个type FunctionBuilder = Seq[Expression] => Expression

[java]view plain copy 
    
 class SimpleFunctionRegistry extends FunctionRegistry {   
   val functionBuilders = new mutable.HashMap[String, FunctionBuilder]() //udf映射关系维护[udfName,Expression]   
    
   def registerFunction(name: String, builder: FunctionBuilder) = { //put expression进Map   
     functionBuilders.put(name, builder)   
   }   
    
   override def lookupFunction(name: String, children: Seq[Expression]): Expression = {   
     functionBuilders(name)(children) //查找udf，返回Expression   
   }   
 }   

至此，我们将一个scala function注册为一个catalyst的一个Expression，这就是spark的simple udf。

三、UDF计算：

UDF既然已经被封装为catalyst树里的一个Expression节点，那么计算的时候也就是计算ScalaUdf的eval方法。

先通过Row和表达式计算function所需要的参数，最后通过反射调用function，来达到计算udf的目的。

ScalaUdf继承自Expression：

scalaUdf接受一个function, dataType，和一系列表达式。

比较简单，看注释即可：

[java]view plain copy 
    
 case class ScalaUdf(function: AnyRef, dataType: DataType, children: Seq[Expression])   
   extends Expression {   
    
   type EvaluatedType = Any   
    
   def nullable = true   
    
   override def toString = s"scalaUDF(${children.mkString(",")})"   
  override def eval(input: Row): Any = {   
     val result = children.size match {   
       case 0 => function.asInstanceOf[() => Any]()   
       case 1 => function.asInstanceOf[(Any) => Any](children(0).eval(input)) //反射调用function   
       case 2 =>   
         function.asInstanceOf[(Any, Any) => Any](   
           children(0).eval(input), //表达式参数计算   
           children(1).eval(input))   
       case 3 =>   
         function.asInstanceOf[(Any, Any, Any) => Any](   
           children(0).eval(input),   
           children(1).eval(input),   
           children(2).eval(input))   
       case 4 =>   
      ......   
        case 22 => //scala function只支持22个参数，这里枚举了。   
         function.asInstanceOf[(Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any) => Any](   
           children(0).eval(input),   
           children(1).eval(input),   
           children(2).eval(input),   
           children(3).eval(input),   
           children(4).eval(input),   
           children(5).eval(input),   
           children(6).eval(input),   
           children(7).eval(input),   
           children(8).eval(input),   
           children(9).eval(input),   
           children(10).eval(input),   
           children(11).eval(input),   
           children(12).eval(input),   
           children(13).eval(input),   
           children(14).eval(input),   
           children(15).eval(input),   
           children(16).eval(input),   
           children(17).eval(input),   
           children(18).eval(input),   
           children(19).eval(input),   
           children(20).eval(input),   
           children(21).eval(input))   

四、总结

Spark目前的UDF其实就是scala function。将scala function封装到一个Catalyst Expression当中，在进行sql计算时，使用同样的Eval方法对当前输入Row进行计算。

编写一个spark udf非常简单，只需给UDF起个函数名，并且传递一个scala function即可。依靠scala函数编程的表现能力，使得编写scala udf比较简单，且相较hive的udf更容易使人理解。

——EOF——

原创文章，转载请注明：

转载自：OopsOutOfMemory盛利的Blog，作者： OopsOutOfMemory

本文链接地址：http://blog.csdn.net/oopsoom/article/details/39395641

注：本文基于署名-非商业性使用-禁止演绎 2.5 中国大陆(CC BY-NC-ND 2.5 CN)协议，欢迎转载、转发和评论，但是请保留本文作者署名和文章链接。如若需要用于商业目的或者与授权方面的协商，请联系我。

Spark-SparkSQL深入学习系列八（转自OopsOutOfMemory）

一、引子：

二、UDF注册

三、UDF计算：

四、总结

热门文章

最新文章

相关课程

相关电子书