generate random or regular test data in R-阿里云开发者社区

如何在R中产生一些规则或不规则的测试数据?

产生连续分布的向量

例子 :

1. 使用冒号. 产生连续整数向量

> 1:10
 [1]  1  2  3  4  5  6  7  8  9 10
> 1:10-2    # 注意:号优先级高于减号运算符
 [1] -1  0  1  2  3  4  5  6  7  8
> 1:(10-2)
[1] 1 2 3 4 5 6 7 8
> a <- 1:10
> a
 [1]  1  2  3  4  5  6  7  8  9 10

2. 使用seq函数, 产生连续值, 可以指定步长.

> seq(1,10)
 [1]  1  2  3  4  5  6  7  8  9 10
> seq(1,10,0.5)   # from=1 to=10 by=0.5
 [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
[16]  8.5  9.0  9.5 10.0
> seq(1,10,1)
 [1]  1  2  3  4  5  6  7  8  9 10
> seq(from=1,to=10,by=1)   # by 指定步长
 [1]  1  2  3  4  5  6  7  8  9 10
> seq(from=1,to=10,length.out=1)
[1] 1
> seq(from=1,to=10,length.out=100)
  [1]  1.000000  1.090909  1.181818  1.272727  1.363636  1.454545  1.545455
  [8]  1.636364  1.727273  1.818182  1.909091  2.000000  2.090909  2.181818
 [15]  2.272727  2.363636  2.454545  2.545455  2.636364  2.727273  2.818182
 [22]  2.909091  3.000000  3.090909  3.181818  3.272727  3.363636  3.454545
 [29]  3.545455  3.636364  3.727273  3.818182  3.909091  4.000000  4.090909
 [36]  4.181818  4.272727  4.363636  4.454545  4.545455  4.636364  4.727273
 [43]  4.818182  4.909091  5.000000  5.090909  5.181818  5.272727  5.363636
 [50]  5.454545  5.545455  5.636364  5.727273  5.818182  5.909091  6.000000
 [57]  6.090909  6.181818  6.272727  6.363636  6.454545  6.545455  6.636364
 [64]  6.727273  6.818182  6.909091  7.000000  7.090909  7.181818  7.272727
 [71]  7.363636  7.454545  7.545455  7.636364  7.727273  7.818182  7.909091
 [78]  8.000000  8.090909  8.181818  8.272727  8.363636  8.454545  8.545455
 [85]  8.636364  8.727273  8.818182  8.909091  9.000000  9.090909  9.181818
 [92]  9.272727  9.363636  9.454545  9.545455  9.636364  9.727273  9.818182
 [99]  9.909091 10.000000

3. 使用scan让用户输入

> scan()
1: 1
2: 2
3: 3
4: 4
5: 100
6: 
Read 5 items
[1]   1   2   3   4 100
> a <- scan()
1: 1
2: 10
3: 100
4: 1000
5: 
Read 4 items
> a
[1]    1   10  100 1000

4. 使用rep重复一个向量值数次, 注意each和times参数的差别.

> a
[1]    1   10  100 1000
> rep(a,each=2)  每个元素重复2次
[1]    1    1   10   10  100  100 1000 1000
> rep(a,times=2)  每个向量重复2次
[1]    1   10  100 1000    1   10  100 1000
> rep(a,each=4,length=10)  # length限制返回向量的长度
 [1]   1   1   1   1  10  10  10  10 100 100
> rep(a,times=4,length=10)
 [1]    1   10  100 1000    1   10  100 1000    1   10

5. 使用sequence函数产生一系列连续整数序列.

> sequence(c(2,3,4,5))   # 产生从1到2, 从1到3, 从1到4, 从1到5的序列.
 [1] 1 2 1 2 3 1 2 3 4 1 2 3 4 5
> sequence(2:5)   # 产生从1到2, 从1到3, 从1到4, 从1到5的序列.
 [1] 1 2 1 2 3 1 2 3 4 1 2 3 4 5
> sequence(5)   从1到5的序列.
[1] 1 2 3 4 5
> sequence(c(2,5))    # 产生从1到2, 从1到5的序列.
[1] 1 2 1 2 3 4 5

6. 使用gl 产生因子

> gl(n=2, k=3, length=10)  #  n是level数量, k是每个level的重复次数, length是总长度
 [1] 1 1 1 2 2 2 1 1 1 2
Levels: 1 2
> gl(n=2, k=3)
[1] 1 1 1 2 2 2
Levels: 1 2
> gl(n=2, k=3, labels=c("a", "b"))  # labels代替数字level
[1] a a a b b b
Levels: a b
> gl(n=2, k=3, labels=c("a", "b", "c"))  # labels代替数字level, 如果n<length(lables), 不需要的level不会出现在上面.
[1] a a a b b b
Levels: a b c
> gl(n=2, k=9, labels=c("a", "b", "c"))
 [1] a a a a a a a a a b b b b b b b b b
Levels: a b c
> gl(n=2, k=9, labels=c("a", "b", "c"), ordered=TRUE)  # 是否排序
 [1] a a a a a a a a a b b b b b b b b b
Levels: a < b < c

7. expand.grid()创建数据框(data.frame)

数据框是列长度相同的多列结构 , 每列的类型可以不一致.

3列如下, 完全匹配, (笛卡尔)

以下一共产生2*2*2行的数据框

> expand.grid(h=c(60,80), w=c(100, 300), sex=c("Male", "Female"))
   h   w    sex
1 60 100   Male
2 80 100   Male
3 60 300   Male
4 80 300   Male
5 60 100 Female
6 80 100 Female
7 60 300 Female
8 80 300 Female

以下一共产生2*2*3行的数据框

> expand.grid(h=c(60,80), w=c(100, 300), sex=c("Male", "Female", "non"))
    h   w    sex
1  60 100   Male
2  80 100   Male
3  60 300   Male
4  80 300   Male
5  60 100 Female
6  80 100 Female
7  60 300 Female
8  80 300 Female
9  60 100    non
10 80 100    non
11 60 300    non
12 80 300    non

产生规则分布的测试数据 :

在统计学中，产生随机数据是很有用的，R可以产生多种不同分布下的随机数序列。

这些分布函数的形式为rfunc(n,p1,p2,...)，其中func指概率分布函数，n为生成数据的个数，p1, p2, . . . 是分布的参数数值。

上面的表给出了每个分布的详情和可能的缺省值（如果没有给出缺省值，则意味着用户必须指定参数）。

大多数这种统计函数都有相似的形式，只需用d、p或者q去替代r (见下表)，比如 :

1. 分布函数的形式为  rfunc(n,p1,p2,...)
2. 密度函数 ( dfunc (x, ...)  , 
3. 累计概率密度函数（也即分布函数）( pfunc (x,...) ) , 
4. 分位数函数( qfunc (p, ...) , 0 < p < 1) .

最后两个函数序列可以用来求统计假设检验中P值或临界值。

例如，显著性水平为5%的正态分布的双侧临界值是 :

> qnorm(0.025)
[1] -1.959964
> qnorm(0.975)
[1] 1.959964

对于同一个检验的单侧临界值，根据备择假设的形式使用qnorm(0.05)或1 -qnorm(0.95)

一个检验的P 值，比如自由度df = 1的? 2 = 3:84 :

> 1 - pchisq(3.84, 1)
[1] 0.05004352

分布名称函数

Gaussian (normal)                   rnorm(n, mean=0, sd=1)
exponential                               rexp(n, rate=1)
gamma                                      rgamma(n, shape, scale=1)
Poisson                                     rpois(n, lambda)
Weibull                                      rweibull(n, shape, scale=1)
Cauchy                                      rcauchy(n, location=0, scale=1)
beta                                           rbeta(n, shape1, shape2)
`Student' (t)                               rt(n, df)
Fisher{Snedecor (F )               rf(n, df1, df2)
Pearson (?2)                            rchisq(n, df)
binomial                                     rbinom(n, size, prob)
multinomial                                rmultinom(n, size, prob)
geometric                                  rgeom(n, prob)
hypergeometric                         rhyper(nn, m, n, k)
logistic                                       rlogis(n, location=0, scale=1)
lognormal                                  rlnorm(n, meanlog=0, sdlog=1)
negative                                     binomial rnbinom(n, size, prob)
uniform                                       runif(n, min=0, max=1)
Wilcoxon's statistics                 rwilcox(nn, m, n), rsignrank(nn, n)

[参考]

1. help("seq")

Description:

     Generate regular sequences.  ‘seq’ is a standard generic with a
     default method.  ‘seq.int’ is a primitive which can be much faster
     but has a few restrictions.  ‘seq_along’ and ‘seq_len’ are very
     fast primitives for two common cases.

Usage:

     seq(...)
     
     ## Default S3 method:
     seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
         length.out = NULL, along.with = NULL, ...)
     
     seq.int(from, to, by, length.out, along.with, ...)
     
     seq_along(along.with)
     seq_len(length.out)
     
Arguments:

     ...: arguments passed to or from methods.

from, to: the starting and (maximal) end values of the sequence.  Of
          length ‘1’ unless just ‘from’ is supplied as an unnamed
          argument.

      by: number: increment of the sequence.

length.out: desired length of the sequence.  A non-negative number,
          which for ‘seq’ and ‘seq.int’ will be rounded up if
          fractional.

along.with: take the length from the length of this argument.
....

2. help('gl')

Description:

     Generate factors by specifying the pattern of their levels.

Usage:

     gl(n, k, length = n*k, labels = seq_len(n), ordered = FALSE)
     
Arguments:

       n: an integer giving the number of levels.

       k: an integer giving the number of replications.

  length: an integer giving the length of the result.

  labels: an optional vector of labels for the resulting factor levels.

 ordered: a logical indicating whether the result should be ordered or
          not.

Value:

     The result has levels from ‘1’ to ‘n’ with each value replicated
     in groups of length ‘k’ out to a total length of ‘length’.

     ‘gl’ is modelled on the _GLIM_ function of the same name.

generate random or regular test data in R

热门文章

最新文章

相关电子书