In this post I will introduce data frames in R. The credit for all the material in this and the future posts goes to DataCamp.

Dataframes

Data frames are bascially created from vectors. It consists of rows (i.e observations) and columns (i.e variables or features). Data frames holds a different type of data in each column. We can create a dataframe in R using data.frame(). Let’s create 3 variables and use them to create a data frame and assign it to a variable cash as follows

company <- c("A", "A", "A", "B", "B", "B", "B")
cash_flow <- c(1000, 4000, 550, 1500, 1100, 750, 6000)
year <- c(1, 3, 4, 1, 2, 4, 5)
cash <- data.frame(
    company = company,
    cash_flow = cash_flow, 
    year = year)
cash
##   company cash_flow year
## 1       A      1000    1
## 2       A      4000    3
## 3       A       550    4
## 4       B      1500    1
## 5       B      1100    2
## 6       B       750    4
## 7       B      6000    5

We can look at the first few rows and/or tails and/or structure of the data frame using head() and/or tail() and/or str() respectively.

head(cash, n=2)
##   company cash_flow year
## 1       A      1000    1
## 2       A      4000    3
tail(cash, n=2)
##   company cash_flow year
## 6       B       750    4
## 7       B      6000    5
str(cash)
## 'data.frame':    7 obs. of  3 variables:
##  $ company  : Factor w/ 2 levels "A","B": 1 1 1 2 2 2 2
##  $ cash_flow: num  1000 4000 550 1500 1100 750 6000
##  $ year     : num  1 3 4 1 2 4 5

str is useful if you want to check the data types of each columns in your data frame.

Just like with vectors and matrices, we can extract or access certain columns or rows of the data frame using [row index, col index/col name], where you just pass in row index and column index or column name. Let’s look at how we do this with the cash data frame:

#thrid row and 2nd col
cash[3,2]  
## [1] 550

#fifth row and cash_flow col - notice we used column name here 
cash[5,"cash_flow"] 
## [1] 1100

To extract a specific column, there is shortcut that you can use, $

# extract "year" column using "$"
cash$year
## [1] 1 3 4 1 2 4 5
cash$cash_flow *2 # manipulate data
## [1]  2000  8000  1100  3000  2200  1500 12000
cash
##   company cash_flow year
## 1       A      1000    1
## 2       A      4000    3
## 3       A       550    4
## 4       B      1500    1
## 5       B      1100    2
## 6       B       750    4
## 7       B      6000    5

We can also subset a data frame using subset() by passing the first argument as data and the second argument as a filter. In the examples below, we subset the data frame to only cash corresponding to company Aand rows that have cash flows due in 1 year:

subset(cash, company == "A") # cash flow for only company A
##   company cash_flow year
## 1       A      1000    1
## 2       A      4000    3
## 3       A       550    4
subset(cash, year == 1) # row having cash flow due in a year
##   company cash_flow year
## 1       A      1000    1
## 4       B      1500    1

We can add a new column using data_frame$new_column. Let’s add quarter_cash column to our data frame by transforming existing columns

# Quarter cash flow scenario
cash$quarter_cash <- cash$cash_flow * .25

# Double year scenario
cash$double_year <- cash$year * 2
cash
##   company cash_flow year quarter_cash double_year
## 1       A      1000    1        250.0           2
## 2       A      4000    3       1000.0           6
## 3       A       550    4        137.5           8
## 4       B      1500    1        375.0           2
## 5       B      1100    2        275.0           4
## 6       B       750    4        187.5           8
## 7       B      6000    5       1500.0          10
cash$company <- NULL # delete column
cash
##   cash_flow year quarter_cash double_year
## 1      1000    1        250.0           2
## 2      4000    3       1000.0           6
## 3       550    4        137.5           8
## 4      1500    1        375.0           2
## 5      1100    2        275.0           4
## 6       750    4        187.5           8
## 7      6000    5       1500.0          10

There is alot that exist about data frames which I haven’t covered in this post, as the aim of these posts to keep it short. Hopefully this will atleast get you started with data frames in R.