dplyr select(): Select one or more variables from a dataframe

21

dplyr select(): How to Select Columns?

dplyr select(): How to Select Columns?

dplyr, R package part of tidyverse, provides a great set of tools to manipulate datasets in the tabular form. dplyr has a set of core functions for “data munging”. Here is the list of core functions from dplyr

  • select() picks variables based on their names.
  • mutate() adds new variables that are functions of existing variables
  • filter() picks cases based on their values.
  • summarise() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows.

And in this tidyverse tutorial, we will learn how to use dplyr’s select() function to pick/select variables/columns from a dataframe by their names. First we will start with how to select a single variable by its name and then we will see examples of selecting multiple variables/columns by their names.

Let us get started by loading tidyverse.

library("tidyverse")

For our tutorial on tidyverse, we will use the Palmer Penguins dataset collated by Allison Horst to illustrate how to use dplyr’s select() function to select variables by their names. Let us load the data from cmdlinetips.com’ github page.

path2data <- "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv"
penguins<- readr::read_csv(path2data)

We can take a glimpse of the data using glimpse() function.

glimpse(penguins)

Rows: 344
Columns: 7
$ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", …
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgers…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17…
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180, 1…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 33…
$ sex               <chr> "male", "female", "female", NA, "female", "male", "female", …

How To Select A Variable by name with dplyr select()?

We can select a variable from a data frame using select() function in two ways. One way is to specify the dataframe name and the variable/column name we want to select as arguments to select() function in dplyr.

In this example below, we select species column from penguins data frame. One big advantage with dplyr/tidyverse is the ability to specify the variable names without quotes.

select(penguins, species)

The result is a type of dataframe called tibble with just one column we selected.

## # A tibble: 344 x 1
##    species
##    <chr>  
##  1 Adelie 
##  2 Adelie 
##  3 Adelie 
##  4 Adelie 
##  5 Adelie 
##  6 Adelie 
##  7 Adelie 
##  8 Adelie 
##  9 Adelie 
## 10 Adelie 
## # … with 334 more rows

The second way to select a column from a dataframe is to use the pipe operator %>% available as part of tidyverse.

Here we first specify the name of the dataframe we want to work with and use the pipe %>% operator followed by select function with the column name we want to select.

penguins %>% select(species)

We get the same data frame as tibble with a single column as before.

## # A tibble: 344 x 1
##    species
##    <chr>  
##  1 Adelie 
##  2 Adelie 
##  3 Adelie 
##  4 Adelie 
##  5 Adelie 
##  6 Adelie 
##  7 Adelie 
##  8 Adelie 
##  9 Adelie 
## 10 Adelie 
## # … with 334 more rows

The use of pipe operator can be extremely useful when we further down stream operations after selecting. Therefore, in the examples below we will the pipe operator way to select multiple columns

How To Select Two Variables by name with dplyr select()?

If we want to select two variables/columns from a dataframe, we specify the two names as arguments. In this example we select species and island columns from the dataframe using the pipe operator.

penguins %>% select(species, island)
## # A tibble: 344 x 2
##    species island   
##    <chr>   <chr>    
##  1 Adelie  Torgersen
##  2 Adelie  Torgersen
##  3 Adelie  Torgersen
##  4 Adelie  Torgersen
##  5 Adelie  Torgersen
##  6 Adelie  Torgersen
##  7 Adelie  Torgersen
##  8 Adelie  Torgersen
##  9 Adelie  Torgersen
## 10 Adelie  Torgersen
## # … with 334 more rows

How To Select Multiple Variables by name with dplyr select()?

Similarly, if we have more variables to select, we specify the names as argument to select() function in dplyr as shown below.

5.3 Select Multiple Columns
penguins %>% select(species, body_mass_g, sex)
## # A tibble: 344 x 3
##    species body_mass_g sex   
##    <chr>         <dbl> <chr> 
##  1 Adelie         3750 male  
##  2 Adelie         3800 female
##  3 Adelie         3250 female
##  4 Adelie           NA <NA>  
##  5 Adelie         3450 female
##  6 Adelie         3650 male  
##  7 Adelie         3625 female
##  8 Adelie         4675 male  
##  9 Adelie         3475 <NA>  
## 10 Adelie         4250 <NA>  
## # … with 334 more rows

The post dplyr select(): Select one or more variables from a dataframe appeared first on Python and R Tips.