hadoop实例 RandomWriter-阿里云开发者社区

参考文献：http://www.hadooper.cn/dct/page/65778

1.概述

RandomWriter（随机写）例子利用 Map/Reduce把数据随机的写到dfs中。每个map输入单个文件名，然后随机写BytesWritable的键和值到DFS顺序文件。map没有产生任何输出，所以reduce没有执行。产生的数据是可以配置的。配置变量如下

名字	默认值	描述
test.randomwriter.maps_per_host	10	每个节点运行的map任务数
test.randomwrite.bytes_per_map	1073741824	每个map任务产生的数据量
test.randomwrite.min_key	10	minimum size of the key in bytes
test.randomwrite.max_key	1000	maximum size of the key in bytes
test.randomwrite.min_value	0	minimum size of the value
test.randomwrite.max_value	20000	maximum size of the value

test.randomwriter.maps_per_host表示每个工作节点（datanode）上运行map的次数。默认情况下，只有一个数据节点，那么就有10个map，每个map的数据量为1G，因此要将10G数据写入到hdfs中。我配置的试验环境中只有2个工作节点，不过我希望每个工作节点只有1个map任务。

test.randomwrite.bytes_per_map我原本以为是随机写输出的测试文件的大小，默认为1G=1*1024*1024*1024，但是我将这个数据改成1*1024*1024以后，输出的测试文件还是1G，这让我很不解。（PS:2011-11-2，今天知道这个参数表示没个map任务产生的数据量，如果将其改为1*1024*1024，那么就表示没个map任务产生的数据量为1MB。）（PS：2011-11-3，修改参数test.randomwrite.bytes_per_map并不能更改每个map任务产生的数据量，还是1G，不管我将这个参数设定为什么值。不过修改参数：test.randomwriter.maps_per_host是有效的。测试发现将该参数设为1和2都测试通过。问题：在哪里修改test.randomwrite.bytes_per_map才能真正修改map任务产生的数据量。！）

2.代码实例

其中test.randomwrite.bytes_per_map=1*1024*1024，test.randomwriter.maps_per_host=1。

 
   [cpp] 
   view plain
   copy
   print
   ? 
  
/** 
 * Licensed to the Apache Software Foundation (ASF) under one 
 * or more contributor license agreements.  See the NOTICE file 
 * distributed with this work for additional information 
 * regarding copyright ownership.  The ASF licenses this file 
 * to you under the Apache License, Version 2.0 (the 
 * "License"); you may not use this file except in compliance 
 * with the License.  You may obtain a copy of the License at 
 * 
 *     http://www.apache.org/licenses/LICENSE-2.0 
 * 
 * Unless required by applicable law or agreed to in writing, software 
 * distributed under the License is distributed on an "AS IS" BASIS, 
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 
 * See the License for the specific language governing permissions and 
 * limitations under the License. 
 */  
  
package org.apache.hadoop.examples;  
  
import java.io.IOException;  
import java.util.Date;  
import java.util.Random;  
  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.conf.Configured;  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.BytesWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.io.Writable;  
import org.apache.hadoop.io.WritableComparable;  
import org.apache.hadoop.mapred.ClusterStatus;  
import org.apache.hadoop.mapred.FileOutputFormat;  
import org.apache.hadoop.mapred.FileSplit;  
import org.apache.hadoop.mapred.InputFormat;  
import org.apache.hadoop.mapred.InputSplit;  
import org.apache.hadoop.mapred.JobClient;  
import org.apache.hadoop.mapred.JobConf;  
import org.apache.hadoop.mapred.MapReduceBase;  
import org.apache.hadoop.mapred.Mapper;  
import org.apache.hadoop.mapred.OutputCollector;  
import org.apache.hadoop.mapred.RecordReader;  
import org.apache.hadoop.mapred.Reporter;  
import org.apache.hadoop.mapred.SequenceFileOutputFormat;  
import org.apache.hadoop.mapred.lib.IdentityReducer;  
import org.apache.hadoop.util.GenericOptionsParser;  
import org.apache.hadoop.util.Tool;  
import org.apache.hadoop.util.ToolRunner;  
  
/** 
 * This program uses map/reduce to just run a distributed job where there is 
 * no interaction between the tasks and each task write a large unsorted 
 * random binary sequence file of BytesWritable. 
 * In order for this program to generate data for terasort with 10-byte keys 
 * and 90-byte values, have the following config: 
 * <xmp> 
 * <?xml version="1.0"?> 
 * <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 
 * <configuration> 
 *   <property> 
 *     <name>test.randomwrite.min_key</name> 
 *     <value>10</value> 
 *   </property> 
 *   <property> 
 *     <name>test.randomwrite.max_key</name> 
 *     <value>10</value> 
 *   </property> 
 *   <property> 
 *     <name>test.randomwrite.min_value</name> 
 *     <value>90</value> 
 *   </property> 
 *   <property> 
 *     <name>test.randomwrite.max_value</name> 
 *     <value>90</value> 
 *   </property> 
 *   <property> 
 *     <name>test.randomwrite.total_bytes</name> 
 *     <value>1099511627776</value> 
 *   </property> 
 * </configuration></xmp> 
 *  
 * Equivalently, {@link RandomWriter} also supports all the above options 
 * and ones supported by {@link GenericOptionsParser} via the command-line. 
 */  
public class RandomWriter extends Configured implements Tool {  
    
  /** 
   * User counters 
   */  
  static enum Counters { RECORDS_WRITTEN, BYTES_WRITTEN }  
    
  /** 
   * A custom input format that creates virtual inputs of a single string 
   * for each map. 
   */  
  static class RandomInputFormat implements InputFormat<Text, Text> {  
  
    /**  
     * Generate the requested number of file splits, with the filename 
     * set to the filename of the output file. 
     */  
    public InputSplit[] getSplits(JobConf job,   
                                  int numSplits) throws IOException {  
      InputSplit[] result = new InputSplit[numSplits];  
      Path outDir = FileOutputFormat.getOutputPath(job);  
      for(int i=0; i < result.length; ++i) {  
        result[i] = new FileSplit(new Path(outDir, "dummy-split-" + i), 0, 1,   
                                  (String[])null);  
      }  
      return result;  
    }  
  
    /** 
     * Return a single record (filename, "") where the filename is taken from 
     * the file split. 
     */  
    static class RandomRecordReader implements RecordReader<Text, Text> {  
      Path name;  
      public RandomRecordReader(Path p) {  
        name = p;  
      }  
      public boolean next(Text key, Text value) {  
        if (name != null) {  
          key.set(name.getName());  
          name = null;  
          return true;  
        }  
        return false;  
      }  
      public Text createKey() {  
        return new Text();  
      }  
      public Text createValue() {  
        return new Text();  
      }  
      public long getPos() {  
        return 0;  
      }  
      public void close() {}  
      public float getProgress() {  
        return 0.0f;  
      }  
    }  
  
    public RecordReader<Text, Text> getRecordReader(InputSplit split,  
                                        JobConf job,   
                                        Reporter reporter) throws IOException {  
      return new RandomRecordReader(((FileSplit) split).getPath());  
    }  
  }  
  
  static class Map extends MapReduceBase  
    implements Mapper<WritableComparable, Writable,  
                      BytesWritable, BytesWritable> {  
      
    private long numBytesToWrite;  
    private int minKeySize;  
    private int keySizeRange;  
    private int minValueSize;  
    private int valueSizeRange;  
    private Random random = new Random();  
    private BytesWritable randomKey = new BytesWritable();  
    private BytesWritable randomValue = new BytesWritable();  
      
    private void randomizeBytes(byte[] data, int offset, int length) {  
      for(int i=offset + length - 1; i >= offset; --i) {  
        data[i] = (byte) random.nextInt(256);  
      }  
    }  
      
    /** 
     * Given an output filename, write a bunch of random records to it. 
     */  
    public void map(WritableComparable key,   
                    Writable value,  
                    OutputCollector<BytesWritable, BytesWritable> output,   
                    Reporter reporter) throws IOException {  
      int itemCount = 0;  
      while (numBytesToWrite > 0) {  
        int keyLength = minKeySize +   
          (keySizeRange != 0 ? random.nextInt(keySizeRange) : 0);  
        randomKey.setSize(keyLength);  
        randomizeBytes(randomKey.getBytes(), 0, randomKey.getLength());  
        int valueLength = minValueSize +  
          (valueSizeRange != 0 ? random.nextInt(valueSizeRange) : 0);  
        randomValue.setSize(valueLength);  
        randomizeBytes(randomValue.getBytes(), 0, randomValue.getLength());  
        output.collect(randomKey, randomValue);  
        numBytesToWrite -= keyLength + valueLength;  
        reporter.incrCounter(Counters.BYTES_WRITTEN, keyLength + valueLength);  
        reporter.incrCounter(Counters.RECORDS_WRITTEN, 1);  
        if (++itemCount % 200 == 0) {  
          reporter.setStatus("wrote record " + itemCount + ". " +   
                             numBytesToWrite + " bytes left.");  
        }  
      }  
      reporter.setStatus("done with " + itemCount + " records.");  
    }  
      
    /** 
     * Save the values out of the configuaration that we need to write 
     * the data. 
     */  
    @Override  
    public void configure(JobConf job) {  
      numBytesToWrite = job.getLong("test.randomwrite.bytes_per_map",  
                                    1*1024*1024);  
      minKeySize = job.getInt("test.randomwrite.min_key", 10);  
      keySizeRange =   
        job.getInt("test.randomwrite.max_key", 1000) - minKeySize;  
      minValueSize = job.getInt("test.randomwrite.min_value", 0);  
      valueSizeRange =   
        job.getInt("test.randomwrite.max_value", 20000) - minValueSize;  
    }  
      
  }  
    
  /** 
   * This is the main routine for launching a distributed random write job. 
   * It runs 10 maps/node and each node writes 1 gig of data to a DFS file. 
   * The reduce doesn't do anything. 
   *  
   * @throws IOException  
   */  
  public int run(String[] args) throws Exception {      
    if (args.length == 0) {  
      System.out.println("Usage: writer <out-dir>");  
      ToolRunner.printGenericCommandUsage(System.out);  
      return -1;  
    }  
      
    Path outDir = new Path(args[0]);  
    JobConf job = new JobConf(getConf());  
      
    job.setJarByClass(RandomWriter.class);  
    job.setJobName("random-writer");  
    FileOutputFormat.setOutputPath(job, outDir);  
      
    job.setOutputKeyClass(BytesWritable.class);  
    job.setOutputValueClass(BytesWritable.class);  
      
    job.setInputFormat(RandomInputFormat.class);  
    job.setMapperClass(Map.class);          
    job.setReducerClass(IdentityReducer.class);  
    job.setOutputFormat(SequenceFileOutputFormat.class);  
      
    JobClient client = new JobClient(job);  
    ClusterStatus cluster = client.getClusterStatus();  
    int numMapsPerHost = job.getInt("test.randomwriter.maps_per_host", 1);  
    long numBytesToWritePerMap = job.getLong("test.randomwrite.bytes_per_map",  
                                             1*1024*1024);  
    if (numBytesToWritePerMap == 0) {  
      System.err.println("Cannot have test.randomwrite.bytes_per_map set to 0");  
      return -2;  
    }  
    long totalBytesToWrite = job.getLong("test.randomwrite.total_bytes",   
         numMapsPerHost*numBytesToWritePerMap*cluster.getTaskTrackers());  
    int numMaps = (int) (totalBytesToWrite / numBytesToWritePerMap);  
    if (numMaps == 0 && totalBytesToWrite > 0) {  
      numMaps = 1;  
      job.setLong("test.randomwrite.bytes_per_map", totalBytesToWrite);  
    }  
      
    job.setNumMapTasks(numMaps);  
    System.out.println("Running " + numMaps + " maps.");  
      
    // reducer NONE  
    job.setNumReduceTasks(0);  
      
    Date startTime = new Date();  
    System.out.println("Job started: " + startTime);  
    JobClient.runJob(job);  
    Date endTime = new Date();  
    System.out.println("Job ended: " + endTime);  
    System.out.println("The job took " +   
                       (endTime.getTime() - startTime.getTime()) /1000 +   
                       " seconds.");  
      
    return 0;  
  }  
    
  public static void main(String[] args) throws Exception {  
    int res = ToolRunner.run(new Configuration(), new RandomWriter(), args);  
    System.exit(res);  
  }  
  
}  

输出信息：

 
   [cpp] 
   view plain
   copy
   print
   ? 
  
11/10/17 13:27:46 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively  
Running 2 maps.  
Job started: Mon Oct 17 13:27:47 CST 2011  
11/10/17 13:27:47 INFO mapred.JobClient: Running job: job_201110171322_0001  
11/10/17 13:27:48 INFO mapred.JobClient:  map 0% reduce 0%  
11/10/17 13:29:58 INFO mapred.JobClient:  map 50% reduce 0%  
11/10/17 13:30:05 INFO mapred.JobClient:  map 100% reduce 0%  
11/10/17 13:30:07 INFO mapred.JobClient: Job complete: job_201110171322_0001  
11/10/17 13:30:07 INFO mapred.JobClient: Counters: 8  
11/10/17 13:30:07 INFO mapred.JobClient:   Job Counters   
11/10/17 13:30:07 INFO mapred.JobClient:     Launched map tasks=3  
11/10/17 13:30:07 INFO mapred.JobClient:   org.apache.hadoop.examples.RandomWriter$Counters  
11/10/17 13:30:07 INFO mapred.JobClient:     BYTES_WRITTEN=2147504078  
11/10/17 13:30:07 INFO mapred.JobClient:     RECORDS_WRITTEN=204528  
11/10/17 13:30:07 INFO mapred.JobClient:   FileSystemCounters  
11/10/17 13:30:07 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2154580318  
11/10/17 13:30:07 INFO mapred.JobClient:   Map-Reduce Framework  
11/10/17 13:30:07 INFO mapred.JobClient:     Map input records=2  
11/10/17 13:30:07 INFO mapred.JobClient:     Spilled Records=0  
11/10/17 13:30:07 INFO mapred.JobClient:     Map input bytes=0  
11/10/17 13:30:07 INFO mapred.JobClient:     Map output records=204528  
Job ended: Mon Oct 17 13:30:07 CST 2011  
The job took 140 seconds.  

在hdfs上产生了两个文件，在/home/hadoop/rand目录下，分别是part-00000（1Gb,r3）和part-00001（1Gb,r3）

本文转自xwdreamer博客园博客，原文链接：http://www.cnblogs.com/xwdreamer/archive/2011/10/17/2296957.html，如需转载请自行联系原作者

hadoop实例 RandomWriter

1.概述

2.代码实例

热门文章

最新文章

相关课程

相关电子书

相关实验场景