数据处理第2节：将列转换为正确的形状-阿里云开发者社区

博客原文：https://suzan.rbind.io/2018/01/dplyr-tutorial-1/
作者：Suzan Baert

这是一系列dplyr函数中的第二篇文章。它涵盖了操纵列以便按照您希望的方式获取它们的工具：这可以是计算新列，将列更改为离散值或拆分/合并列。

数据集
根据之前的博客文章，当你有很多专栏时，为了方便人们复制粘贴代码和实验，我使用的是ggplot2内置数据集

library(tidyverse)

glimpse(msleep)

## Observations: 83
## Variables: 11
## $ name         <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea...
## $ genus        <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bo...
## $ vore         <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi...
## $ order        <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph...
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N...
## $ sleep_total  <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1...
## $ sleep_rem    <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0....
## $ sleep_cycle  <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.38...
## $ awake        <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9,...
## $ brainwt      <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0....
## $ bodywt       <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.4...

转换列：基础部分

您可以使用mutate（）函数创建新列。 mutate中的选项几乎是无穷无尽的：你可以对普通向量做任何事情，可以在mutate（）函数内完成。mutate中的任何内容都可以是新列（通过赋予mutate新的列名），或者可以替换当前列（通过保持相同的列名）。

最简单的选项之一是基于其他列中的值的计算。在示例代码中，我们将睡眠数据从以小时为单位的数据更改为分钟。

msleep %>%
  select(name, sleep_total) %>%
  mutate(sleep_total_min = sleep_total * 60)

## # A tibble: 83 x 3
##    name                       sleep_total sleep_total_min
##    <chr>                            <dbl>           <dbl>
##  1 Cheetah                          12.1              726
##  2 Owl monkey                       17.0             1020
##  3 Mountain beaver                  14.4              864
##  4 Greater short-tailed shrew       14.9              894
##  5 Cow                               4.00             240
##  6 Three-toed sloth                 14.4              864
##  7 Northern fur seal                 8.70             522
##  8 Vesper mouse                      7.00             420
##  9 Dog                              10.1              606
## 10 Roe deer                          3.00             180
## # ... with 73 more rows

可以使用aggregate函数制作新列，例如average，median，max，min，sd等等。示例代码生成两个新列：一列显示观察对象与平均睡眠时间的差值，一列显示观察对象与睡眠最少的动物的差值。

msleep %>%
  select(name, sleep_total) %>%
  mutate(sleep_total_vs_AVG = sleep_total - round(mean(sleep_total), 1),
         sleep_total_vs_MIN = sleep_total - min(sleep_total))

## # A tibble: 83 x 4
##    name                       sleep_total sleep_total_vs_AVG sleep_total_~
##    <chr>                            <dbl>              <dbl>         <dbl>
##  1 Cheetah                          12.1               1.70          10.2 
##  2 Owl monkey                       17.0               6.60          15.1 
##  3 Mountain beaver                  14.4               4.00          12.5 
##  4 Greater short-tailed shrew       14.9               4.50          13.0 
##  5 Cow                               4.00             -6.40           2.10
##  6 Three-toed sloth                 14.4               4.00          12.5 
##  7 Northern fur seal                 8.70             -1.70           6.80
##  8 Vesper mouse                      7.00             -3.40           5.10
##  9 Dog                              10.1              -0.300          8.20
## 10 Roe deer                          3.00             -7.40           1.10
## # ... with 73 more rows

在下面的评论中，Steve询问了跨列的aggregate函数。这些函数本质上需要总结一个列（如上所示），如果你想在列之间使用sum（）或mean（），你可能会遇到错误或荒谬的答案。在这些情况下，您可以恢复实际拼写算术：mutate（average =（sleep_rem + sleep_cycle）/ 2）或者您必须向管道添加一条特殊指令，它应该执行这些聚合函数而不是整个列，但按行：

#alternative to using the actual arithmetics:
msleep %>%
  select(name, contains("sleep")) %>%
  rowwise() %>%
  mutate(avg = mean(c(sleep_rem, sleep_cycle)))

## Source: local data frame [83 x 5]
## Groups: <by row>
##
## # A tibble: 83 x 5
##    name                       sleep_total sleep_rem sleep_cycle    avg
##    <chr>                            <dbl>     <dbl>       <dbl>  <dbl>
##  1 Cheetah                          12.1     NA          NA     NA    
##  2 Owl monkey                       17.0      1.80       NA     NA    
##  3 Mountain beaver                  14.4      2.40       NA     NA    
##  4 Greater short-tailed shrew       14.9      2.30        0.133  1.22
##  5 Cow                               4.00     0.700       0.667  0.683
##  6 Three-toed sloth                 14.4      2.20        0.767  1.48
##  7 Northern fur seal                 8.70     1.40        0.383  0.892
##  8 Vesper mouse                      7.00    NA          NA     NA    
##  9 Dog                              10.1      2.90        0.333  1.62
## 10 Roe deer                          3.00    NA          NA     NA    
## # ... with 73 more rows

ifelse（）函数值得特别提及，因为如果你不想以相同的方式改变整个列，它会特别有用。使用ifelse（），首先指定一个逻辑语句，然后在语句返回“TRUE”时需要发生什么，最后如果它是“FALSE”则需要发生什么。

想象一下，我们有一个包含两个大值的数据库，我们假设它们是拼写错误或测量错误，我们想要排除它们。下面的代码将使任何brainwt值超过4并返回NA。在这种情况下，代码不会因4以下的任何内容而改变。

msleep %>%
  select(name, brainwt) %>%
  mutate(brainwt2 = ifelse(brainwt > 4, NA, brainwt)) %>%
  arrange(desc(brainwt))

## # A tibble: 83 x 3
##    name             brainwt brainwt2
##    <chr>              <dbl>    <dbl>
##  1 African elephant   5.71    NA    
##  2 Asian elephant     4.60    NA    
##  3 Human              1.32     1.32 
##  4 Horse              0.655    0.655
##  5 Chimpanzee         0.440    0.440
##  6 Cow                0.423    0.423
##  7 Donkey             0.419    0.419
##  8 Gray seal          0.325    0.325
##  9 Baboon             0.180    0.180
## 10 Pig                0.180    0.180
## # ... with 73 more rows

您还可以使用stringr的str_extract（）函数以及任何字符或正则表达式模式来改变字符串列。示例代码将返回动物名称的最后一个单词并使其为小写。

msleep %>%
  select(name) %>%
  mutate(name_last_word = tolower(str_extract(name, pattern = "\\w+$")))

## # A tibble: 83 x 2
##    name                       name_last_word
##    <chr>                      <chr>         
##  1 Cheetah                    cheetah       
##  2 Owl monkey                 monkey        
##  3 Mountain beaver            beaver        
##  4 Greater short-tailed shrew shrew         
##  5 Cow                        cow           
##  6 Three-toed sloth           sloth         
##  7 Northern fur seal          seal          
##  8 Vesper mouse               mouse         
##  9 Dog                        dog           
## 10 Roe deer                   deer          
## # ... with 73 more rows

一次性Mutate数列

这就是有趣的地方。就像第1部分中的select（）函数一样，mutate（）有变种：

*mutate_all（）将根据您的进一步说明改变所有列
*mutate_if（）首先需要一个返回布尔值的函数来选择列。如果确实如此，那么将对这些变量进行mutate指令。
*mutate_at（）要求你在vars（）参数中指定要进行变异的列。

Mutate全部列

mutate_all（）版本是最容易理解的，在清理数据时非常漂亮。您只需传递要在所有列中应用的操作（以函数的形式）。容易入手：将所有数据转换为小写：

msleep %>%
  mutate_all(tolower)

## # A tibble: 83 x 11
##    name   genus vore  order conservation sleep_total sleep_rem sleep_cycle
##    <chr>  <chr> <chr> <chr> <chr>        <chr>       <chr>     <chr>      
##  1 cheet~ acin~ carni carn~ lc           12.1        <NA>      <NA>       
##  2 owl m~ aotus omni  prim~ <NA>         17          1.8       <NA>       
##  3 mount~ aplo~ herbi rode~ nt           14.4        2.4       <NA>       
##  4 great~ blar~ omni  sori~ lc           14.9        2.3       0.133333333
##  5 cow    bos   herbi arti~ domesticated 4           0.7       0.666666667
##  6 three~ brad~ herbi pilo~ <NA>         14.4        2.2       0.766666667
##  7 north~ call~ carni carn~ vu           8.7         1.4       0.383333333
##  8 vespe~ calo~ <NA>  rode~ <NA>         7           <NA>      <NA>       
##  9 dog    canis carni carn~ domesticated 10.1        2.9       0.333333333
## 10 roe d~ capr~ herbi arti~ lc           3           <NA>      <NA>       
## # ... with 73 more rows, and 3 more variables: awake <chr>, brainwt <chr>,
## #   bodywt <chr>

mutating 动作需要是一个函数：在许多情况下，您可以传递函数名称而不使用括号，但在某些情况下，您需要参数或者您想要组合元素。在这种情况下，您有一些选择：要么预先创建一个函数（如果它更长时间有用），或者通过将它包装在funs（）或波形符中来动态创建函数。我首先要使用mutate_all（）搞砸了：下面的粘贴变异需要动态的函数。你可以使用〜paste（。，“/ n”）或funs（paste（。，“/ n”））。在动态创建函数时，通常需要一种方法来引用要替换的值：这是.符号。

msleep_ohno <- msleep %>%
  mutate_all(~paste(., "  /n  "))

msleep_ohno[,1:4]

## # A tibble: 83 x 4
##    name                                genus                vore    order
##    <chr>                               <chr>                <chr>   <chr>
##  1 "Cheetah   /n  "                    "Acinonyx   /n  "    "carni~ "Carn~
##  2 "Owl monkey   /n  "                 "Aotus   /n  "       "omni ~ "Prim~
##  3 "Mountain beaver   /n  "            "Aplodontia   /n  "  "herbi~ "Rode~
##  4 "Greater short-tailed shrew   /n  " "Blarina   /n  "     "omni ~ "Sori~
##  5 "Cow   /n  "                        "Bos   /n  "         "herbi~ "Arti~
##  6 "Three-toed sloth   /n  "           "Bradypus   /n  "    "herbi~ "Pilo~
##  7 "Northern fur seal   /n  "          "Callorhinus   /n  " "carni~ "Carn~
##  8 "Vesper mouse   /n  "               "Calomys   /n  "     "NA   ~ "Rode~
##  9 "Dog   /n  "                        "Canis   /n  "       "carni~ "Carn~
## 10 "Roe deer   /n  "                   "Capreolus   /n  "   "herbi~ "Arti~
## # ... with 73 more rows

让我们再次清理它：
在这段代码中它首先删除任何/ n，然后修剪任何其他空格：

msleep_corr <- msleep_ohno %>%
  mutate_all(~str_replace_all(., "/n", "")) %>%
  mutate_all(str_trim)

msleep_corr[,1:4]

## # A tibble: 83 x 4
##    name                       genus       vore  order       
##    <chr>                      <chr>       <chr> <chr>       
##  1 Cheetah                    Acinonyx    carni Carnivora   
##  2 Owl monkey                 Aotus       omni  Primates    
##  3 Mountain beaver            Aplodontia  herbi Rodentia    
##  4 Greater short-tailed shrew Blarina     omni  Soricomorpha
##  5 Cow                        Bos         herbi Artiodactyla
##  6 Three-toed sloth           Bradypus    herbi Pilosa      
##  7 Northern fur seal          Callorhinus carni Carnivora   
##  8 Vesper mouse               Calomys     NA    Rodentia    
##  9 Dog                        Canis       carni Carnivora   
## 10 Roe deer                   Capreolus   herbi Artiodactyla
## # ... with 73 more rows

Mutate if

并非所有的清理功能都可以使用mutate_all（）来完成。如果同时具有数字和字符列，则尝试对数据进行舍入将导致错误。

msleep %>%
  mutate_all(round)

Error in mutate_impl(.data, dots) : Evaluation error: non-numeric argument to mathematical function.

在这些情况下，我们必须在给出round（）指令之前添加列需要为数字的条件，这可以使用mutate_if来完成。

通过使用mutate_if（），我们在管道中需要两个参数：

首先，它需要有关列的信息。此信息必须是返回布尔值的函数。最简单的情况是is.numeric，is.integer，is.double，is.logical，is.factor，lubridate :: is.POSIXt或lubridate :: is.Date。
其次，它需要以函数形式的变异指令。如果需要，请使用代字号或funs（）之前（见上文）。

msleep %>%
  select(name, sleep_total:bodywt) %>%
  mutate_if(is.numeric, round)

## # A tibble: 83 x 7
##    name             sleep_total sleep_rem sleep_cycle awake brainwt bodywt
##    <chr>                  <dbl>     <dbl>       <dbl> <dbl>   <dbl>  <dbl>
##  1 Cheetah                12.0      NA          NA    12.0       NA  50.0
##  2 Owl monkey             17.0       2.00       NA     7.00       0   0   
##  3 Mountain beaver        14.0       2.00       NA    10.0       NA   1.00
##  4 Greater short-t~       15.0       2.00        0     9.00       0   0   
##  5 Cow                     4.00      1.00        1.00 20.0        0 600   
##  6 Three-toed sloth       14.0       2.00        1.00 10.0       NA   4.00
##  7 Northern fur se~        9.00      1.00        0    15.0       NA  20.0
##  8 Vesper mouse            7.00     NA          NA    17.0       NA   0   
##  9 Dog                    10.0       3.00        0    14.0        0  14.0
## 10 Roe deer                3.00     NA          NA    21.0        0  15.0
## # ... with 73 more rows

更改特定列

通过使用mutate_at（），我们在管道中需要两个参数：

首先，它需要有关列的信息。在这种情况下，您可以包装任何列的选择（使用select（）函数内可能的所有选项）并将其包装在vars（）中。
其次，它需要以函数形式的变异指令。如果需要，请使用代字号或funs（）之前（见上文）。

所有睡眠测量柱都在几小时内完成。如果我想在几分钟内完成，我可以使用mutate_at（）并将包含列的所有'sleep'包装在vars（）中。其次，我在飞行中创建一个函数，将每个值乘以60。
示例代码显示，在这种情况下，所有sleep列都已更改为分钟，但awake没有。

msleep %>%
  select(name, sleep_total:awake) %>%
  mutate_at(vars(contains("sleep")), ~(.*60))

## # A tibble: 83 x 5
##    name                       sleep_total sleep_rem sleep_cycle awake
##    <chr>                            <dbl>     <dbl>       <dbl> <dbl>
##  1 Cheetah                            726      NA         NA    11.9
##  2 Owl monkey                        1020     108         NA     7.00
##  3 Mountain beaver                    864     144         NA     9.60
##  4 Greater short-tailed shrew         894     138          8.00  9.10
##  5 Cow                                240      42.0       40.0  20.0
##  6 Three-toed sloth                   864     132         46.0   9.60
##  7 Northern fur seal                  522      84.0       23.0  15.3
##  8 Vesper mouse                       420      NA         NA    17.0
##  9 Dog                                606     174         20.0  13.9
## 10 Roe deer                           180      NA         NA    21.0
## # ... with 73 more rows

mutation后更改列名

使用单个mutate（）语句，您可以立即选择更改列名称。例如，在上面的示例中，令人困惑的是睡眠列位于不同的单元中，您可以通过调用重命名函数来更改它：

msleep %>%
  select(name, sleep_total:awake) %>%
  mutate_at(vars(contains("sleep")), ~(.*60)) %>%
  rename_at(vars(contains("sleep")), ~paste0(.,"_min"))

## # A tibble: 83 x 5
##    name                sleep_total_min sleep_rem_min sleep_cycle_min awake
##    <chr>                         <dbl>         <dbl>           <dbl> <dbl>
##  1 Cheetah                         726          NA             NA    11.9
##  2 Owl monkey                     1020         108             NA     7.00
##  3 Mountain beaver                 864         144             NA     9.60
##  4 Greater short-tail~             894         138              8.00  9.10
##  5 Cow                             240          42.0           40.0  20.0
##  6 Three-toed sloth                864         132             46.0   9.60
##  7 Northern fur seal               522          84.0           23.0  15.3
##  8 Vesper mouse                    420          NA             NA    17.0
##  9 Dog                             606         174             20.0  13.9
## 10 Roe deer                        180          NA             NA    21.0
## # ... with 73 more rows

https://twitter.com/TomasMcManus1/status/981187099649912832）指出：你可以在funs（）中指定一个“标签”，它将附加到当前名称。两个选项之间的主要区别是：funs（）版本是一行代码少，但是将添加而不是替换列。根据您的情况，两者都可能有用。

msleep %>%
  select(name, sleep_total:awake) %>%
  mutate_at(vars(contains("sleep")), funs(min = .*60))

## # A tibble: 83 x 8
##    name            sleep_total sleep_rem sleep_cycle awake sleep_total_min
##    <chr>                 <dbl>     <dbl>       <dbl> <dbl>           <dbl>
##  1 Cheetah               12.1     NA          NA     11.9              726
##  2 Owl monkey            17.0      1.80       NA      7.00            1020
##  3 Mountain beaver       14.4      2.40       NA      9.60             864
##  4 Greater short-~       14.9      2.30        0.133  9.10             894
##  5 Cow                    4.00     0.700       0.667 20.0              240
##  6 Three-toed slo~       14.4      2.20        0.767  9.60             864
##  7 Northern fur s~        8.70     1.40        0.383 15.3              522
##  8 Vesper mouse           7.00    NA          NA     17.0              420
##  9 Dog                   10.1      2.90        0.333 13.9              606
## 10 Roe deer               3.00    NA          NA     21.0              180
## # ... with 73 more rows, and 2 more variables: sleep_rem_min <dbl>,
## #   sleep_cycle_min <dbl>

使用离散列

重新编码离散列

要重命名或重新组织当前的离散列，可以在mutate（）语句中使用recode（）：这使您可以更改当前命名，或将当前级别分组到更低级别。 .default指的是除NA之外的前组不包含的任何内容。如果需要，可以通过添加.missing参数将NA更改为NA以外的其他参数（请参阅下一个示例代码）。

msleep %>%
  mutate(conservation2 = recode(conservation,
                        "en" = "Endangered",
                        "lc" = "Least_Concern",
                        "domesticated" = "Least_Concern",
                        .default = "other")) %>%
  count(conservation2)

## # A tibble: 4 x 2
##   conservation2     n
##   <chr>         <int>
## 1 Endangered        4
## 2 Least_Concern    37
## 3 other            13
## 4 <NA>             29

A special version exists to return a factor: recode_factor(). By default the .ordered argument is FALSE. To return an ordered factor set the argument to TRUE:

msleep %>%
  mutate(conservation2 = recode_factor(conservation,
                        "en" = "Endangered",
                        "lc" = "Least_Concern",
                        "domesticated" = "Least_Concern",
                        .default = "other",
                        .missing = "no data",
                        .ordered = TRUE)) %>%
  count(conservation2)

## # A tibble: 4 x 2
##   conservation2     n
##   <ord>         <int>
## 1 Endangered        4
## 2 Least_Concern    37
## 3 other            13
## 4 no data          29

创建新的离散型数据列（两个level）

ifelse（）语句可用于将数字列转换为离散列。如上所述，ifelse（）采用逻辑表达式，然后如果表达式返回“TRUE”则该怎么办，最后当它返回“FALSE”时要做什么。示例代码将当前度量“sleep_total”划分为离散的“长”或“短”睡眠者。

msleep %>%
  select(name, sleep_total) %>%
  mutate(sleep_time = ifelse(sleep_total > 10, "long", "short")) 

## # A tibble: 83 x 3
##    name                       sleep_total sleep_time
##    <chr>                            <dbl> <chr>     
##  1 Cheetah                          12.1  long      
##  2 Owl monkey                       17.0  long      
##  3 Mountain beaver                  14.4  long      
##  4 Greater short-tailed shrew       14.9  long      
##  5 Cow                               4.00 short     
##  6 Three-toed sloth                 14.4  long      
##  7 Northern fur seal                 8.70 short     
##  8 Vesper mouse                      7.00 short     
##  9 Dog                              10.1  long      
## 10 Roe deer                          3.00 short     
## # ... with 73 more rows

创建新的离散列（多个级别）

ifelse（）可以嵌套，但如果你想要两个以上的级别，但是使用case_when（）可能更容易，它允许你喜欢的语句数量多，并且比许多嵌套的ifelse更容易阅读声明。
参数按顺序计算，因此只有第一个语句不为true的行才会继续为下一个语句计算。对于最后留下的所有内容，只需使用TRUE~“newname”。
不幸的是，似乎没有简单的方法让case_when（）返回一个有序的因子，所以你需要自己做，之后使用forcats :: fct_relevel（），或者只是一个因子（）函数。如果你有很多关卡，我会建议你提前制作一个关卡矢量，以避免过多地混乱。

msleep %>%
  select(name, sleep_total) %>%
  mutate(sleep_total_discr = case_when(
    sleep_total > 13 ~ "very long",
    sleep_total > 10 ~ "long",
    sleep_total > 7 ~ "limited",
    TRUE ~ "short")) %>%
  mutate(sleep_total_discr = factor(sleep_total_discr, 
                                    levels = c("short", "limited", 
                                               "long", "very long")))

## # A tibble: 83 x 3
##    name                       sleep_total sleep_total_discr
##    <chr>                            <dbl> <fctr>           
##  1 Cheetah                          12.1  long             
##  2 Owl monkey                       17.0  very long        
##  3 Mountain beaver                  14.4  very long        
##  4 Greater short-tailed shrew       14.9  very long        
##  5 Cow                               4.00 short            
##  6 Three-toed sloth                 14.4  very long        
##  7 Northern fur seal                 8.70 limited          
##  8 Vesper mouse                      7.00 short            
##  9 Dog                              10.1  long             
## 10 Roe deer                          3.00 short            
## # ... with 73 more rows

case_when（）函数不仅可以在单独列工作，还可以用于跨列分组：

msleep %>%
  mutate(silly_groups = case_when(
    brainwt < 0.001 ~ "light_headed",
    sleep_total > 10 ~ "lazy_sleeper",
    is.na(sleep_rem) ~ "absent_rem",
    TRUE ~ "other")) %>%
  count(silly_groups)

## # A tibble: 4 x 2
##   silly_groups     n
##   <chr>        <int>
## 1 absent_rem       8
## 2 lazy_sleeper    39
## 3 light_headed     6
## 4 other           30

拆分和合并列

数据来源：（https://raw.githubusercontent.com/suzanbaert/RTutorials/master/Rmd_originals/conservation_explanation.csv）

(conservation_expl <- read_csv("conservation_explanation.csv"))

## # A tibble: 11 x 1
##    `conservation abbreviation`                  
##    <chr>                                        
##  1 EX = Extinct                                 
##  2 EW = Extinct in the wild                     
##  3 CR = Critically Endangered                   
##  4 EN = Endangered                              
##  5 VU = Vulnerable                              
##  6 NT = Near Threatened                         
##  7 LC = Least Concern                           
##  8 DD = Data deficient                          
##  9 NE = Not evaluated                           
## 10 PE = Probably extinct (informal)             
## 11 PEW = Probably extinct in the wild (informal)

您可以使用tidyr的separate（）函数拆分列。为此，首先指定要拆分的列，然后指定新的列名，以及用于拆分的分隔符。示例代码显示基于'='作为分隔符分隔成两列。

(conservation_table <- conservation_expl %>%
  separate(`conservation abbreviation`, 
           into = c("abbreviation", "description"), sep = " = "))

## # A tibble: 11 x 2
##    abbreviation description                            
##  * <chr>        <chr>                                  
##  1 EX           Extinct                                
##  2 EW           Extinct in the wild                    
##  3 CR           Critically Endangered                  
##  4 EN           Endangered                             
##  5 VU           Vulnerable                             
##  6 NT           Near Threatened                        
##  7 LC           Least Concern                          
##  8 DD           Data deficient                         
##  9 NE           Not evaluated                          
## 10 PE           Probably extinct (informal)            
## 11 PEW          Probably extinct in the wild (informal)

相反的是tidyr的unite（）函数。您指定新列名称，然后指定要合并的列，最后指定要使用的分隔符。

conservation_table %>%
  unite(united_col, abbreviation, description, sep=": ")

## # A tibble: 11 x 1
##    united_col                                  
##  * <chr>                                       
##  1 EX: Extinct                                 
##  2 EW: Extinct in the wild                     
##  3 CR: Critically Endangered                   
##  4 EN: Endangered                              
##  5 VU: Vulnerable                              
##  6 NT: Near Threatened                         
##  7 LC: Least Concern                           
##  8 DD: Data deficient                          
##  9 NE: Not evaluated                           
## 10 PE: Probably extinct (informal)             
## 11 PEW: Probably extinct in the wild (informal)

从其他数据表中引入列

如果要添加另一个数据框的信息，可以使用dplyr中的连接函数。连接本身就是一个章节，但在这种特殊情况下你会做一个left_join（），即保持我的主表（在左边），并从另一个向右添加列。在by =语句中，您指定哪些列相同，因此连接知道要添加的位置。
示例代码将把不同保护状态的描述添加到主msleep表中。主要数据包含一个额外的“domisticated”标签，我想保留。这是在表的最后一行用ifelse（）完成的。

msleep %>%
  select(name, conservation) %>%
  mutate(conservation = toupper(conservation)) %>%
  left_join(conservation_table, by = c("conservation" = "abbreviation")) %>%
  mutate(description = ifelse(is.na(description), conservation, description))

## # A tibble: 83 x 3
##    name                       conservation description    
##    <chr>                      <chr>        <chr>          
##  1 Cheetah                    LC           Least Concern  
##  2 Owl monkey                 <NA>         <NA>           
##  3 Mountain beaver            NT           Near Threatened
##  4 Greater short-tailed shrew LC           Least Concern  
##  5 Cow                        DOMESTICATED DOMESTICATED   
##  6 Three-toed sloth           <NA>         <NA>           
##  7 Northern fur seal          VU           Vulnerable     
##  8 Vesper mouse               <NA>         <NA>           
##  9 Dog                        DOMESTICATED DOMESTICATED   
## 10 Roe deer                   LC           Least Concern  
## # ... with 73 more rows

展开和聚合数据

gather（）函数会将多列合并为一列。在这种情况下，我们有3列描述时间度量。对于某些分析和图表，可能有必要将它们合二为一。
gather函数需要您为新的描述性列指定名称（“key”），并为值列指定另一个名称（“value”）。最后需要取消选择您不想收集的列。在示例代码中，我取消选择列name。

msleep %>%
  select(name, contains("sleep")) %>%
  gather(key = "sleep_measure", value = "time", -name)

## # A tibble: 249 x 3
##    name                       sleep_measure  time
##    <chr>                      <chr>         <dbl>
##  1 Cheetah                    sleep_total   12.1 
##  2 Owl monkey                 sleep_total   17.0 
##  3 Mountain beaver            sleep_total   14.4 
##  4 Greater short-tailed shrew sleep_total   14.9 
##  5 Cow                        sleep_total    4.00
##  6 Three-toed sloth           sleep_total   14.4 
##  7 Northern fur seal          sleep_total    8.70
##  8 Vesper mouse               sleep_total    7.00
##  9 Dog                        sleep_total   10.1 
## 10 Roe deer                   sleep_total    3.00
## # ... with 239 more rows

聚集中有用的属性是factor_key参数，默认为“FALSE”。在前面的示例中，新列“sleep_measure”是一个字符向量。如果您要进行总结或后续的绘制，则该列将按字母顺序排序。如果要保留原始顺序，请添加“factor_key = TRUE”，这将使新列成为有序因子。

(msleep_g <- msleep %>%
  select(name, contains("sleep")) %>%
  gather(key = "sleep_measure", value = "time", -name, factor_key = TRUE))

## # A tibble: 249 x 3
##    name                       sleep_measure  time
##    <chr>                      <fctr>        <dbl>
##  1 Cheetah                    sleep_total   12.1 
##  2 Owl monkey                 sleep_total   17.0 
##  3 Mountain beaver            sleep_total   14.4 
##  4 Greater short-tailed shrew sleep_total   14.9 
##  5 Cow                        sleep_total    4.00
##  6 Three-toed sloth           sleep_total   14.4 
##  7 Northern fur seal          sleep_total    8.70
##  8 Vesper mouse               sleep_total    7.00
##  9 Dog                        sleep_total   10.1 
## 10 Roe deer                   sleep_total    3.00
## # ... with 239 more rows

聚合的反面是展开。 Spread将占用一列并从中生成多列。如果您已经开始使用上一列，则可以在不同的列中获得不同的睡眠度量：

msleep_g %>%
  spread(sleep_measure, time)

## # A tibble: 83 x 4
##    name                      sleep_total sleep_rem sleep_cycle
##  * <chr>                           <dbl>     <dbl>       <dbl>
##  1 African elephant                 3.30     NA         NA    
##  2 African giant pouched rat        8.30      2.00      NA    
##  3 African striped mouse            8.70     NA         NA    
##  4 Arctic fox                      12.5      NA         NA    
##  5 Arctic ground squirrel          16.6      NA         NA    
##  6 Asian elephant                   3.90     NA         NA    
##  7 Baboon                           9.40      1.00       0.667
##  8 Big brown bat                   19.7       3.90       0.117
##  9 Bottle-nosed dolphin             5.20     NA         NA    
## 10 Brazilian tapir                  4.40      1.00       0.900
## # ... with 73 more rows

将数据转换为NA

函数na_if（）将特定值转换为NA。在大多数情况下，命令可能是na_if（“”）（即将空字符串转换为NA），但原则上你可以做任何事情。相同的代码会将任何“omni”的值转换为NA

msleep %>%
  select(name:order) %>%
  na_if("omni")

## # A tibble: 83 x 4
##    name                       genus       vore  order       
##    <chr>                      <chr>       <chr> <chr>       
##  1 Cheetah                    Acinonyx    carni Carnivora   
##  2 Owl monkey                 Aotus       <NA>  Primates    
##  3 Mountain beaver            Aplodontia  herbi Rodentia    
##  4 Greater short-tailed shrew Blarina     <NA>  Soricomorpha
##  5 Cow                        Bos         herbi Artiodactyla
##  6 Three-toed sloth           Bradypus    herbi Pilosa      
##  7 Northern fur seal          Callorhinus carni Carnivora   
##  8 Vesper mouse               Calomys     <NA>  Rodentia    
##  9 Dog                        Canis       carni Carnivora   
## 10 Roe deer                   Capreolus   herbi Artiodactyla
## # ... with 73 more rows

数据处理第2节：将列转换为正确的形状