[Spark][Python][DataFrame][Write]DataFrame写入的例子-阿里云开发者社区

[Spark][Python][DataFrame][Write]DataFrame写入的例子

$ hdfs dfs -cat people.json

{"name":"Alice","pcode":"94304"}

{"name":"Brayden","age":30,"pcode":"94304"}

{"name":"Carla","age":19,"pcoe":"10036"}

{"name":"Diana","age":46}

{"name":"Etienne","pcode":"94104"}

$pyspark

sqlContext = HiveContext(sc)

peopleDF = sqlContext.read.json("people.json")

peopleDF.write.format("parquet").mode("append").partitionBy("age").saveAsTable("people")

  
 
17/10/07 00:58:18 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 65.5 KB, free 338.2 KB)
 17/10/07 00:58:18 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 21.4 KB, free 359.6 KB)
 17/10/07 00:58:18 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:59616 (size: 21.4 KB, free: 208.8 MB)
 17/10/07 00:58:18 INFO spark.SparkContext: Created broadcast 2 from saveAsTable at NativeMethodAccessorImpl.java:-2
 17/10/07 00:58:18 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 251.1 KB, free 610.7 KB)
 17/10/07 00:58:18 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 21.6 KB, free 632.4 KB)
 17/10/07 00:58:18 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:59616 (size: 21.6 KB, free: 208.7 MB)
 17/10/07 00:58:18 INFO spark.SparkContext: Created broadcast 3 from saveAsTable at NativeMethodAccessorImpl.java:-2
 17/10/07 00:58:19 INFO parquet.ParquetRelation: Using default output committer for Parquet: parquet.hadoop.ParquetOutputCommitter
 17/10/07 00:58:19 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
 17/10/07 00:58:19 INFO datasources.DynamicPartitionWriterContainer: Using user defined output committer class parquet.hadoop.ParquetOutputCommitter
 17/10/07 00:58:19 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
 17/10/07 00:58:19 INFO mapred.FileInputFormat: Total input paths to process : 1
 17/10/07 00:58:19 INFO spark.SparkContext: Starting job: saveAsTable at NativeMethodAccessorImpl.java:-2
 17/10/07 00:58:19 INFO scheduler.DAGScheduler: Got job 1 (saveAsTable at NativeMethodAccessorImpl.java:-2) with 1 output partitions
 17/10/07 00:58:19 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (saveAsTable at NativeMethodAccessorImpl.java:-2)
 17/10/07 00:58:19 INFO scheduler.DAGScheduler: Parents of final stage: List()
 17/10/07 00:58:19 INFO scheduler.DAGScheduler: Missing parents: List()
 17/10/07 00:58:19 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[7] at saveAsTable at NativeMethodAccessorImpl.java:-2), which has no missing parents
 17/10/07 00:58:19 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 72.7 KB, free 705.0 KB)
 17/10/07 00:58:20 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 26.4 KB, free 731.4 KB)
 17/10/07 00:58:20 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:59616 (size: 26.4 KB, free: 208.7 MB)
 17/10/07 00:58:20 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006
 17/10/07 00:58:20 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[7] at saveAsTable at NativeMethodAccessorImpl.java:-2)
 17/10/07 00:58:20 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
 17/10/07 00:58:20 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,PROCESS_LOCAL, 2149 bytes)
 17/10/07 00:58:20 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 1)
 17/10/07 00:58:20 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/people.json:0+179
 17/10/07 00:58:20 INFO codegen.GenerateUnsafeProjection: Code generated in 314.888218 ms
 17/10/07 00:58:20 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
 17/10/07 00:58:20 INFO datasources.DynamicPartitionWriterContainer: Using user defined output committer class parquet.hadoop.ParquetOutputCommitter
 17/10/07 00:58:20 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
 17/10/07 00:58:20 INFO codegen.GenerateUnsafeProjection: Code generated in 46.978197 ms
 17/10/07 00:58:20 INFO codegen.GenerateUnsafeProjection: Code generated in 64.665839 ms
 17/10/07 00:58:21 INFO codegen.GenerateUnsafeProjection: Code generated in 94.259071 ms
 17/10/07 00:58:21 INFO codec.CodecConfig: Compression: GZIP
 17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Parquet block size to 134217728
 17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Parquet page size to 1048576
 17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
 17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Dictionary is on
 17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Validation is off
 17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
 17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Maximum row group padding size is 8388608 bytes
 17/10/07 00:58:21 INFO parquet.CatalystWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
 {
 "type" : "struct",
 "fields" : [ {
 "name" : "name",
 "type" : "string",
 "nullable" : true,
 "metadata" : { }
 }, {
 "name" : "pcode",
 "type" : "string",
 "nullable" : true,
 "metadata" : { }
 }, {
 "name" : "pcoe",
 "type" : "string",
 "nullable" : true,
 "metadata" : { }
 } ]
 }
 and corresponding Parquet message type:
 message spark_schema {
 optional binary name (UTF8);
 optional binary pcode (UTF8);
 optional binary pcoe (UTF8);
 }

 17/10/07 00:58:21 INFO compress.CodecPool: Got brand-new compressor [.gz]
 17/10/07 00:58:21 INFO datasources.DynamicPartitionWriterContainer: Maximum partitions reached, falling back on sorting.
 17/10/07 00:58:21 INFO codegen.GenerateUnsafeProjection: Code generated in 34.281133 ms
 17/10/07 00:58:21 INFO codegen.GenerateOrdering: Code generated in 85.573905 ms
 17/10/07 00:58:21 INFO datasources.DynamicPartitionWriterContainer: Sorting complete. Writing out partition files one at a time.
 17/10/07 00:58:21 INFO hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 54
 SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/parquet-hadoop-bundle-1.5.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/parquet-pig-bundle-1.5.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/parquet-format-2.1.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/hive-jdbc-1.1.0-cdh5.7.0-standalone.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/hive-exec-1.1.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
 SLF4J: Actual binding is of type [shaded.parquet.org.slf4j.helpers.NOPLoggerFactory]
 17/10/07 00:58:21 INFO hadoop.ColumnChunkPageWriteStore: written 80B for [name] BINARY: 2 values, 26B raw, 43B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
 17/10/07 00:58:21 INFO hadoop.ColumnChunkPageWriteStore: written 73B for [pcode] BINARY: 2 values, 24B raw, 38B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
 17/10/07 00:58:21 INFO hadoop.ColumnChunkPageWriteStore: written 47B for [pcoe] BINARY: 2 values, 6B raw, 26B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
 17/10/07 00:58:22 INFO codec.CodecConfig: Compression: GZIP
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet block size to 134217728
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet page size to 1048576
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Dictionary is on
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Validation is off
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Maximum row group padding size is 8388608 bytes
 17/10/07 00:58:22 INFO parquet.CatalystWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
 {
 "type" : "struct",
 "fields" : [ {
 "name" : "name",
 "type" : "string",
 "nullable" : true,
 "metadata" : { }
 }, {
 "name" : "pcode",
 "type" : "string",
 "nullable" : true,
 "metadata" : { }
 }, {
 "name" : "pcoe",
 "type" : "string",
 "nullable" : true,
 "metadata" : { }
 } ]
 }
 and corresponding Parquet message type:
 message spark_schema {
 optional binary name (UTF8);
 optional binary pcode (UTF8);
 optional binary pcoe (UTF8);
 }

 17/10/07 00:58:22 INFO compress.CodecPool: Got brand-new compressor [.gz]
 17/10/07 00:58:22 INFO hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 26
 17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 68B for [name] BINARY: 1 values, 15B raw, 33B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
 17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 47B for [pcode] BINARY: 1 values, 6B raw, 26B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
 17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 68B for [pcoe] BINARY: 1 values, 15B raw, 33B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
 17/10/07 00:58:22 INFO codec.CodecConfig: Compression: GZIP
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet block size to 134217728
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet page size to 1048576
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Dictionary is on
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Validation is off
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Maximum row group padding size is 8388608 bytes
 17/10/07 00:58:22 INFO parquet.CatalystWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
 {
 "type" : "struct",
 "fields" : [ {
 "name" : "name",
 "type" : "string",
 "nullable" : true,
 "metadata" : { }
 }, {
 "name" : "pcode",
 "type" : "string",
 "nullable" : true,
 "metadata" : { }
 }, {
 "name" : "pcoe",
 "type" : "string",
 "nullable" : true,
 "metadata" : { }
 } ]
 }
 and corresponding Parquet message type:
 message spark_schema {
 optional binary name (UTF8);
 optional binary pcode (UTF8);
 optional binary pcoe (UTF8);
 }

 17/10/07 00:58:22 INFO compress.CodecPool: Got brand-new compressor [.gz]
 17/10/07 00:58:22 INFO hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 28
 17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 74B for [name] BINARY: 1 values, 17B raw, 35B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
 17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 68B for [pcode] BINARY: 1 values, 15B raw, 33B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
 17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 47B for [pcoe] BINARY: 1 values, 6B raw, 26B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
 17/10/07 00:58:22 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on localhost:59616 in memory (size: 21.4 KB, free: 208.7 MB)
 17/10/07 00:58:22 INFO codec.CodecConfig: Compression: GZIP
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet block size to 134217728
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet page size to 1048576
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Dictionary is on
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Validation is off
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
 17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Maximum row group padding size is 8388608 bytes
 17/10/07 00:58:22 INFO parquet.CatalystWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
 {
 "type" : "struct",
 "fields" : [ {
 "name" : "name",
 "type" : "string",
 "nullable" : true,
 "metadata" : { }
 }, {
 "name" : "pcode",
 "type" : "string",
 "nullable" : true,
 "metadata" : { }
 }, {
 "name" : "pcoe",
 "type" : "string",
 "nullable" : true,
 "metadata" : { }
 } ]
 }
 and corresponding Parquet message type:
 message spark_schema {
 optional binary name (UTF8);
 optional binary pcode (UTF8);
 optional binary pcoe (UTF8);
 }

 17/10/07 00:58:22 INFO compress.CodecPool: Got brand-new compressor [.gz]
 17/10/07 00:58:22 INFO hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 13
 17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 68B for [name] BINARY: 1 values, 15B raw, 33B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
 17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 47B for [pcode] BINARY: 1 values, 6B raw, 26B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
 17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 47B for [pcoe] BINARY: 1 values, 6B raw, 26B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
 17/10/07 00:58:22 INFO output.FileOutputCommitter: Saved output of task 'attempt_201710070058_0001_m_000000_0' to hdfs://localhost:8020/user/hive/warehouse/people/_temporary/0/task_201710070058_0001_m_000000
 17/10/07 00:58:22 INFO mapred.SparkHadoopMapRedUtil: attempt_201710070058_0001_m_000000_0: Committed
 17/10/07 00:58:22 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1). 2057 bytes result sent to driver
 17/10/07 00:58:22 INFO scheduler.DAGScheduler: ResultStage 1 (saveAsTable at NativeMethodAccessorImpl.java:-2) finished in 2.797 s
 17/10/07 00:58:22 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 2797 ms on localhost (1/1)
 17/10/07 00:58:22 INFO scheduler.DAGScheduler: Job 1 finished: saveAsTable at NativeMethodAccessorImpl.java:-2, took 3.236619 s
 17/10/07 00:58:22 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
 17/10/07 00:58:23 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
 17/10/07 00:58:23 INFO datasources.DynamicPartitionWriterContainer: Job job_201710070058_0000 committed.
 17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people on driver
 17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=19 on driver
 17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=30 on driver
 17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=46 on driver
 17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=__HIVE_DEFAULT_PARTITION__ on driver
 17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people on driver
 17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=19 on driver
 17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=30 on driver
 17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=46 on driver
 17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=__HIVE_DEFAULT_PARTITION__ on driver
 17/10/07 00:58:24 WARN hive.HiveContext$$anon$2: Persisting partitioned data source relation `people` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. Input path(s): 
 hdfs://localhost:8020/user/hive/warehouse/people

  
 

[training@localhost ~]$ hive

hive>
> show tables like 'people';
OK
people
Time taken: 5.046 seconds, Fetched: 1 row(s)
hive>

sqlContext =HiveContext(sc)
newPeopleDF = sqlContext.read.table("people")

newPeopleDF.limit(5).show()

+-------+-----+-----+----+
| name|pcode| pcoe| age|
+-------+-----+-----+----+
|Brayden|94304| null| 30|
| Diana| null| null| 46|
| Carla| null|10036| 19|
| Alice|94304| null|null|
|Etienne|94104| null|null|
+-------+-----+-----+----+

可以看到，确实把一个从jason 读取得到的 DataFrame，写入了parquet 格式的表，表名为 people

然后，通过再一次地通过 HiveContext 来读取此表，得到并显示了它的数据。

本文转自健哥的数据花园博客园博客，原文链接：http://www.cnblogs.com/gaojian/p/dataframe_write.html，如需转载请自行联系原作者

[Spark][Python][DataFrame][Write]DataFrame写入的例子

热门文章

最新文章

相关课程

相关电子书

相关实验场景