spark-2.0-从RDD到DataSet

简介: DataSet API和DataFrame两者结合起来,DataSet中许多的API模仿了RDD的API,实现不太一样,但是基于RDD的代码很容易移植过来。 spark未来基本是要在DataSet上扩展了,因为spark基于spark core关注的东西很多,整合内部代码是必然的。

DataSet API和DataFrame两者结合起来,DataSet中许多的API模仿了RDD的API,实现不太一样,但是基于RDD的代码很容易移植过来。
spark未来基本是要在DataSet上扩展了,因为spark基于spark core关注的东西很多,整合内部代码是必然的。
1、加载文件

val rdd = sparkContext.textFile("./data.txt")
val ds = sparkSession.read.text("./data.txt")

2、计算总数

rdd.count()

ds.count()

3、wordcount实例

val wordsRDD = rdd.flatMap(value => value.split("\\s+"))
val wordsPairs = wordsRDD.map(word => (word,1))
val wordCount = wordsPairs.reduceByKey(_+_)
import sparkSession.implicits._
val wordsDs = ds.flatMap(value => value.split("\\s+"))
val wordsPairDs = wordsDs.groupByKey(value => value)
val wordCounts = wordsPairDs.count()

4、缓存

rdd.cache()

ds.cache()

5、过滤

val filterRDD = wordsRDD.filter(value => value=="hello")

val filterDs = wordsDs.filter(value => value = "hello")

6、map partition

val mapPartitionsRDD = rdd.mapPartitions(iterator => List(iterator.count(value=>true)).iterator)

val mapPartitionsDs = ds.mapPartitions(iterator => List(iterator.count(value=>true)).iterator)

7 、reduceByKey

val reduceCountByRDD = wordsPair.reduceByKey(_+_)

val reduceCountByDs = wordsPairDs.mapGroups((key,values) =>(key,values.length))

8、RDD和 DataSet互换

val dsToRDD = ds.rdd
val rddStringToRowRDD = rdd.map(value => Row(value))
val dfschema = StructType(Array(StructField("value",StringType)))
val rddToDF = sparkSession.createDataFrame(rddStringToRowRDD,dfschema)
val rDDToDataSet = rddToDF.as[String]

9、double

val doubleRDD = sparkContext.makeRDD(List(1.0,5.0,8.9,9.0))
val rddSum =doubleRDD.sum()
val rddMean = doubleRDD.mean()
val rowRDD = doubleRDD.map(value => Row.fromSeq(List(value)))
val schema = StructType(Array(StructField("value",DoubleType)))
val doubleDS = sparkSession.createDataFrame(rowRDD,schema)
import org.apache.spark.sql.functions._
doubleDS.agg(sum("value"))
doubleDS.agg(mean("value"))

10、reduce

val rddReduce = doubleRDD.reduce((a,b) => a +b)
val dsReduce = doubleDS.reduce((row1,row2) =>Row(row1.getDouble(0) + row2.getDouble(0)))

code

import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}

object RDDToDataSet {

  def main(args: Array[String]) {

    val sparkSession = SparkSession.builder.master("local")
                                           .appName("example")
                                           .getOrCreate()
    val sparkContext = sparkSession.sparkContext
    //read data from text file
    val rdd = sparkContext.textFile("src/main/resources/data.txt")
    val ds = sparkSession.read.text("src/main/resources/data.txt")

    // do count
    println("count ")
    println(rdd.count())
    println(ds.count())

    // wordcount
    println(" wordcount ")

    val wordsRDD = rdd.flatMap(value => value.split("\\s+"))
    val wordsPair = wordsRDD.map(word => (word,1))
    val wordCount = wordsPair.reduceByKey(_+_)
    println(wordCount.collect.toList)

    import sparkSession.implicits._
    val wordsDs = ds.flatMap(value => value.split("\\s+"))
    val wordsPairDs = wordsDs.groupByKey(value => value)
    val wordCountDs = wordsPairDs.count
    wordCountDs.show()

    //cache
    rdd.cache()
    ds.cache()

    //filter

    val filteredRDD = wordsRDD.filter(value => value =="hello")
    println(filteredRDD.collect().toList)

    val filteredDS = wordsDs.filter(value => value =="hello")
    filteredDS.show()


    //map partitions

    val mapPartitionsRDD = rdd.mapPartitions(iterator => 
    List(iterator.count(value => true)).iterator)
    println(s" the count each partition is ${mapPartitionsRDD.collect().toList}")

    val mapPartitionsDs = ds.mapPartitions(iterator => 
    List(iterator.count(value => true)).iterator)
    mapPartitionsDs.show()

    //converting to each other
    val dsToRDD = ds.rdd
    println(dsToRDD.collect())

    val rddStringToRowRDD = rdd.map(value => Row(value))
    val dfschema = StructType(Array(StructField("value",StringType)))
    val rddToDF = sparkSession.createDataFrame(rddStringToRowRDD,dfschema)
    val rDDToDataSet = rddToDF.as[String]
    rDDToDataSet.show()

    // double based operation

    val doubleRDD = sparkContext.makeRDD(List(1.0,5.0,8.9,9.0))
    val rddSum =doubleRDD.sum()
    val rddMean = doubleRDD.mean()

    println(s"sum is $rddSum")
    println(s"mean is $rddMean")

    val rowRDD = doubleRDD.map(value => Row.fromSeq(List(value)))
    val schema = StructType(Array(StructField("value",DoubleType)))
    val doubleDS = sparkSession.createDataFrame(rowRDD,schema)

    import org.apache.spark.sql.functions._
    doubleDS.agg(sum("value")).show()
    doubleDS.agg(mean("value")).show()

    //reduceByKey API
    val reduceCountByRDD = wordsPair.reduceByKey(_+_)
    val reduceCountByDs = wordsPairDs.mapGroups((key,values) =>(key,values.length))

    println(reduceCountByRDD.collect().toList)
    println(reduceCountByDs.collect().toList)

    //reduce function
    val rddReduce = doubleRDD.reduce((a,b) => a +b)
    val dsReduce = doubleDS.reduce((row1,row2) =>
    Row(row1.getDouble(0) + row2.getDouble(0)))

    println("rdd reduce is " +rddReduce +" dataset reduce "+dsReduce)

  }

}
目录
相关文章
|
1月前
|
分布式计算 并行计算 大数据
Spark学习---day02、Spark核心编程(RDD概述、RDD编程(创建、分区规则、转换算子、Action算子))(一)
Spark学习---day02、Spark核心编程 RDD概述、RDD编程(创建、分区规则、转换算子、Action算子))(一)
70 1
|
4月前
|
分布式计算 大数据 Scala
【大数据技术Hadoop+Spark】Spark RDD创建、操作及词频统计、倒排索引实战(超详细 附源码)
【大数据技术Hadoop+Spark】Spark RDD创建、操作及词频统计、倒排索引实战(超详细 附源码)
89 1
|
1月前
|
分布式计算 Java Scala
Spark学习---day03、Spark核心编程(RDD概述、RDD编程(创建、分区规则、转换算子、Action算子))(二)
Spark学习---day03、Spark核心编程(RDD概述、RDD编程(创建、分区规则、转换算子、Action算子))(二)
41 1
|
1月前
|
分布式计算 Spark
Spark【Spark学习大纲】简介+生态+RDD+安装+使用(xmind分享)
【2月更文挑战第14天】Spark【Spark学习大纲】简介+生态+RDD+安装+使用(xmind分享)
30 1
|
1月前
|
分布式计算 Hadoop Java
Spark【基础知识 03】【RDD常用算子详解】(图片来源于网络)
【2月更文挑战第14天】Spark【基础知识 03】【RDD常用算子详解】(图片来源于网络)
56 1
|
1月前
|
存储 缓存 分布式计算
Spark学习--day04、RDD依赖关系、RDD持久化、RDD分区器、RDD文件读取与保存
Spark学习--day04、RDD依赖关系、RDD持久化、RDD分区器、RDD文件读取与保存
39 1
|
2月前
|
分布式计算 并行计算 Hadoop
Spark学习---day02、Spark核心编程(RDD概述、RDD编程(创建、分区规则、转换算子、Action算子))(一)
Spark学习---day02、Spark核心编程 RDD概述、RDD编程(创建、分区规则、转换算子、Action算子))(一)
41 1
|
2月前
|
分布式计算 大数据 Java
Spark 大数据实战:基于 RDD 的大数据处理分析
Spark 大数据实战:基于 RDD 的大数据处理分析
120 0
|
3月前
|
缓存 分布式计算 监控
Spark RDD操作性能优化技巧
Spark RDD操作性能优化技巧
|
3月前
|
分布式计算 监控 大数据
Spark RDD分区和数据分布:优化大数据处理
Spark RDD分区和数据分布:优化大数据处理