Basic of Selecting Data

In this article we will discuss the basic of the data manipulation – slicing and selecting. As we have seen, majority of time, we need to work with a subset of the data for the analysis. We might need just need to work on the description on one variable. Such situation is common in the data analytics workflow. So, selection and slicing of right data in efficient manner is an essential skill for a data analyst. The article will discuss about selecting data for following data structures:

  1. Vector

  2. List

  3. Data Frame

Vectors:

Vector is the basic data-structure in R. The element of the vectors must be of same data types. Meaning a vector can’t contain the mixed variables like integer, string, Boolean etc. Vectors are one dimensional.

For the selection of a specific element of the vector, we simply put the index number inside the big bracket. For instance, if we have to select the 2nd element of vector X, our code will be X[2]. Let me remind you one more time, unlike other majority of programming languages, the indexing of R always start at 1. We will explore the details in the demo below.

#Making some vectors

vec1 <- c(11,23,34,46,51)
vec2 <- c('this','is','a','vector')
vec3 <- c(TRUE, FALSE, FALSE, TRUE)

#Selecting the single values
vec1[2]
## [1] 23
#selecting the multiple values
vec1[c(2,3)] #in this example we are selecting the 2nd and 3rd element of vec1
## [1] 23 34
#selecting after a certain index
vec1[1:5] #we selected everything from index 1 to 5.
## [1] 11 23 34 46 51
vec2[-1]  # -1 will include everything except the first element
## [1] "is"     "a"      "vector"

List

List are different than the vector because they can have multiple data types.The list are formulated by using list() instead of c() in vectors. List sometimes can be recursive as a list can contain list within the list. I know that’s confusing but will explain on the demo below:

lst1 <- list(100, 101, 102, 103, 104, 105)
lst2 <- list(155, 'program', TRUE, 3.25, 200)
lst3 <- list(lst1, lst2 , 534, 546, TRUE, 1.98) #the complicated list

#selection in the plain list
lst1[2] #this is similar to the vector
## [[1]]
## [1] 101
lst2[3:5]
## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] 3.25
## 
## [[3]]
## [1] 200
#selection in the complicated one
lst3[[2]][2]  
## [[1]]
## [1] "program"
#first double bracket is selecting the 2nd element of the lst3 
#which is lst2, then we are selecting the 2nd element of lst2.

lst3[[1]][2:5] 
## [[1]]
## [1] 101
## 
## [[2]]
## [1] 102
## 
## [[3]]
## [1] 103
## 
## [[4]]
## [1] 104
#another similar example like above but we selecting from all 
#elements between 2nd and 5th. 

Data Frame

Data Frame are the general structure that we usually see in our analysis project. Think them as the Excel sheet- with the rows and columns. The proper methods of selecting right data from the data frame plays a vital role in data analysis.

data= iris #loading the iris data and saving in the 'data' variable
head(data) #looking at the first few rows of the dataset.
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
#QUESTION 1: How do you select the value in 3rd column and 2nd row?
row_number = 2
column_number =3

data[row_number,column_number] #1.4 is our output. 
## [1] 1.4
#QUESTION 2: How to select all the rows of the 2nd column?
data$Sepal.Width #the '$' sign in the easiest way to select a column.
##   [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5
##  [19] 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2
##  [37] 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3
##  [55] 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5 3.2 2.8
##  [73] 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0 3.4 3.1 2.3 3.0 2.5
##  [91] 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7 3.0 2.9 3.0 3.0 2.5 2.9
## [109] 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 2.2 3.2 2.8 2.8 2.7 3.3 3.2
## [127] 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2
## [145] 3.3 3.0 2.5 3.0 3.4 3.0
#QUESTION 3:How to select multiple column?
data[1:10,c(2,3)] #by indexing. Making vector of needed column.
##    Sepal.Width Petal.Length
## 1          3.5          1.4
## 2          3.0          1.4
## 3          3.2          1.3
## 4          3.1          1.5
## 5          3.6          1.4
## 6          3.9          1.7
## 7          3.4          1.4
## 8          3.4          1.5
## 9          2.9          1.4
## 10         3.1          1.5
data[1:10,c('Sepal.Width','Sepal.Length')] #by using column name
##    Sepal.Width Sepal.Length
## 1          3.5          5.1
## 2          3.0          4.9
## 3          3.2          4.7
## 4          3.1          4.6
## 5          3.6          5.0
## 6          3.9          5.4
## 7          3.4          4.6
## 8          3.4          5.0
## 9          2.9          4.4
## 10         3.1          4.9
#QUESTION 4: How to select first 10 rows of 2nd and 3rd column
data[1:10,c(2,3)]
##    Sepal.Width Petal.Length
## 1          3.5          1.4
## 2          3.0          1.4
## 3          3.2          1.3
## 4          3.1          1.5
## 5          3.6          1.4
## 6          3.9          1.7
## 7          3.4          1.4
## 8          3.4          1.5
## 9          2.9          1.4
## 10         3.1          1.5
#QUESTION 5: Select all the 'Sepal.Length' that is greater than 7
data[data$Sepal.Length>7,] 
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 103          7.1         3.0          5.9         2.1 virginica
## 106          7.6         3.0          6.6         2.1 virginica
## 108          7.3         2.9          6.3         1.8 virginica
## 110          7.2         3.6          6.1         2.5 virginica
## 118          7.7         3.8          6.7         2.2 virginica
## 119          7.7         2.6          6.9         2.3 virginica
## 123          7.7         2.8          6.7         2.0 virginica
## 126          7.2         3.2          6.0         1.8 virginica
## 130          7.2         3.0          5.8         1.6 virginica
## 131          7.4         2.8          6.1         1.9 virginica
## 132          7.9         3.8          6.4         2.0 virginica
## 136          7.7         3.0          6.1         2.3 virginica
#in the row we defined our condition and in column we left empty because we 
#needed all the columns

In this article, I just showed the basic of selecting data in R. Indeed, these method can be made more complex to subset various data. Similar to ‘QUESTION 5’, we can use the filter conditions to select different variety of data. However, this article is intended for the people who are getting started with R. I will post much advance methods in future posts.

Thank You

Sandesh Sharma
Sandesh Sharma
A Data Advocate
comments powered by Disqus