Quick Intro to DataSet and R
Datasets are made of Dimensions and Metrics. Dimensions are attributes and hold any description/name in data and Metrics are quantitative measurements and hold numbers in data.
There are many ways of reading data. I will show some of them:
For reading CSV file data in R:
mydata <- read.csv(file = '/pathToTheFile/NameOfTheFile.csv', header = FALSE)
ormydata <- read.csv(file = '/pathToTheFile/NameOfTheFile.csv', header = TRUE)
For reading RDS file in R:
mydata <- readRDS("/pathtothefile/file.rds")
There are some ways of saving data:
If you want to extract your data frame to CSV file:
write.csv(name of your data frame, "Desire_name.csv", row.names = FALSE)
Or if you wanna save it as RDS file:
saveRDS(mydata, "file.rds")
For viewing your data frame:
View(mydata)
If you want to change any column name in your data you can use the following:
names(mydata)[column-number] <- paste("Your desire name")
For creating a column in data frame with null value:
mydata$Desire_name <- NA
If you need to use a library that you do not have it yet, you should install it first:
install.packages("name of the library")
After installation for using it make sure you include the library:
library(name of the library)
For instance, if you want to build a column with the condition/s you can use Mutate() function from “tidyverse/dplyr” library. But first, you need to install that if you have not already!
install.packages("tidyvers")
Then you can include them:
library(tidyverse)
library(dplyr)
If you want to get rid of one column (V1) for some reason in your dataset, you can use this:
mydata$V1 <- NULL
You can initialize a variable in R:
my_variable <- 0
If you have a data set that includes different data about the same person, for example, in different dates and you want to get all the information about that person in last date that has updated you can use filter function as below:
df <- mydata %>% group_by(name,family_name) %>% filter(last_updated == max(last_updated))
If you want to figure out the type of each column in your data frame:
sapply(mydata, class)
For changing the type of the column for instance numeric to integer, we can do the following:
mydata[sapply(mydata, is.numeric)] <- lapply(mydata[sapply(mydata, is.numeric)], as.integer)
Sometimes we need to ungroup the data frame before using “mutate” function like for ranking:
mydata <- mydata %>% ungroup()
Master Script
This comes in handy when you have multiple scripts that you want to run them together with one script instead of running them separately.
So, the first thing that you should know if you do not know already is how to open your project in R. The image below shows you how to get there. You can find this bottom on the top right of your R window and select the directory of your project. This makes everything easier to access your files in the master script.
After that just start a new R script and name it whatever you want like “master_script”.

Now, try to write functions or variables in your master_script that you use globally within the scripts that you want to run them together. This helps you to remove duplicated functions or simply prevents you from repeating some functions over and over again. With that, you will save more time and space for running the scripts.
Note: Make sure you enter the libraries. You will just enter the libraries that you will use in the master_script.
For calling the scripts from master_script you can simply do as following:
source("scripts/time_series.r")
Manipulating data in R
Here, we want to rank the metrics of our data based on two dimensions.
Now, you can just make your data smaller by selecting certain columns together and make a new data frame to work on it:
One thing that you can do with “dplyr” library is extracting certain columns from your data frame:
mydata %>% select(3,4,5,7,8)
*That numbers are the number of columns that you want to combine together. You can also use the name of the columns it depends on how do you want to do it.
Then, you can start ordering different matrics for your observation or understanding the data better.
For ordering numbers in one of the columns and put it in a variable:
variable_name <- order(mydata$column_name)
Or
mydata <- arrange(mydata$column)
Finally, if you want to rank a column V3 based on columns V1 and V2 do this:
mydata <- arrange(mydf, desc(V3), V1, V2) %>% mutate(rank = 1:n())
*desc(V3) is sorting column V3 from high to low. Without desc() you will sort it low to high.
Externally, if you want to split/subset dataset based on some conditions in rows value in column V1 for instance, you can do this:
subdata <- subset(mydata, V1!="")
Ever wonder how you can do thousands of seperators when your numbers are big to make your table look nicer?
If so, then here is one of your solutions. The “comma()’ from the “scale” library for doing so on a numeric vector.
mydata$column_big_numbers <- comma( mydata$column_big_numbers, format = "d" )
Using functions like “transform()” and “cumsum()” in R
So, let’s say you want to sum up all the matrics belong to one column and add the result to another column. You can use “cumsum()” as below:
mydata <- mydata %>% mutate( total_column = cumsum( column_numbers ) )
As you can see I am using mutate that makes the coding style nicer and easier and the library for it named “dplyr”.
Now, let’s make it more complicated. We have the “first_name” and the “last_name” columns of each person, we have the “updated_date” of each number, and we have the “column_number“. So, we need to sum up all the numbers that belong to a specific person in a specific timeline. We can do as following:
mydata <- mydata %>% group_by(first_name, last_name) %>%
filter(column_number != 0 & updated_date >= min(updated_date) ) %>%
mutate( total_column = cumsum( column_number ) )
Now, we are going to do some calculations based on some metrics in our data table by using “transform()”.
We want to multiply a number (like “100”) to one of the matrics and divide the multiplication into another matric then, put the calculated number into the new column.
You will be surprised to see that all of these can be handled in one line code in R by transform function:
mydata <- mydata %>% transform(., new_column = (column_numbers * 100) / another_column )
Scoring based on multiple metrics/conditions in R
For creating a score column for a database based on multiple columns in the same database, we need to go through all rows and check all the columns’ value, gather them, calculate, and score them.
So we can use for loop in R:
for(row in 1:nrow(mydata)) {...}
*With that, we can iterate through the rows in the table.
For accessing every value of each column through iteration:
variable_name <- mydata[row, "column_name"]
Then you can check the variable and manipulate it:
if(variable_name == "Yes") {score = score + 3} else
if(variable_name == "No") {score = score + 1}
If you just want to insert value to a column through each iteration you can just add this to you loop function body:
mydata[row, "column_name"] <- as.integer(score/2)
*as.integer(score/2) is optional you can put just score if you want.
Plotting /Table in R
We are going to show the data at a couple of very nice tables here. There are lots of libraries for this purpose in R such as Xtable, Kable, Formattable and etc.. I am going to introduce two of them.
The “KableExtra” library has a lot of good functionality for showing data on the table. Make sure you install and include the library in your code:
library(knitr)
library(kableExtra)
One thing that could potentially make the data that you want to show more constructed and beautiful is to get rid of long long decimals:
is.num <- sapply(mydata, is.numeric)
mydata[is.num] <- lapply(mydata[is.num], round, 4)
*here we are specifying no more than 4 decimal numbers.
Now, by studying this page, you can customize your data table as you wish. Below is an example:
kable(iris) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
font_size = 8, full_width = F) %>%
column_spec(4:5, bold = T) %>%
row_spec(1:3, bold = T, color = "white", background = "#D7261E") %>%
row_spec(3:5, bold = T, color = "white", background = "#FFA500") %>%
row_spec(5:7, bold = T, color = "white", background = "#FFD700")

Another way:
iris[1:20,1:5] %>%
mutate(
Sepal.Width = cell_spec(Sepal.Width, "html", color = ifelse(Sepal.Width > 3.5 , "red", "orange")) ) %>%
select( Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,Species) %>%
kable(format = "html",escape = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
font_size = 9, full_width = F) %>%
column_spec(3:4, bold = T) %>%
row_spec(1:6, bold = T, color = "black", background = "#FFCCCB") %>%
row_spec(6:10, bold = T, color = "black", background = "#FED8B1")

Another library named “formattable” is very good too. It is easier to use this if you want to draw up and down arrows for due to some changes in your dataset. Make sure you include the library for it first!
library(formattable)
Here is an example:
formattable(iris, list(
Petal.Width =formatter(
"span",
style = x ~ ifelse(x >0.5,
style(font.weight = "bold"), NA)),
Sepal.Width = formatter(
"span",
style = x ~ style(color = ifelse(x < 3.5 , "red", "green")),
x ~ icontext(ifelse(x < 3.5, "arrow-down", "arrow-up"), x)))
)

For saving the table as “.png” you can use this function below:
export_formattable <- function(f, file, width = "100%", height = NULL,
background = "white", delay = 0.2)
{
w <- as.htmlwidget(f, width = width, height = height)
path <- html_print(w, background = background, viewer = NULL)
url <- paste0("file:///", gsub("\\", "/", normalizePath(path)))
webshot(url,
file = file,
selector = ".formattable_widget",
delay = delay)
}
export_formattable(pic, "pic.png")
Joining tables in R
There are seven different types of joins. Currently, “dplyr” library supports four types of mutating joins, two types of filtering joins, and a nesting join.
Mutating joins combine variables from the two data.frames:inner_join()
: return all rows from x
where there are matching values in y
, and all columns from x
and y
. If there are multiple matches between x
and y
, all combination of the matches are returned.left_join()
: return all rows from x
, and all columns from x
and y
. Rows in x
with no match in y
will have NA
values in the new columns. If there are multiple matches between x
and y
, all combinations of the matches are returned.right_join()
: return all rows from y
, and all columns from x
and y. Rows in y
with no match in x
will have NA
values in the new columns. If there are multiple matches between x
and y
, all combinations of the matches are returned.full_join()
: return all rows and all columns from both x
and y
. Where there are no matching values, returns NA
for the one missing.
Filtering joins keep cases from the left-hand data.frame:semi_join()
: return all rows from x
where there are matching values in y
, keeping just columns from x
. A semi join differs from an inner join because an inner join will return one row of x
for each matching row of y
, where a semi join will never duplicate rows of x
.anti_join()
: return all rows from x
where there are not matching values in y
, keeping just columns from x
.
Nesting joins create a list column of data.frames:nest_join()
: return all rows and all columns from x
. Adds a list column of tibbles. Each tibble contains all the rows from y
that match that row of x
. When there is no match, the list column is a 0-row tibble with the same column names and types as y
.
*Probably, this page and this page would be helpful for instructions.
As an example, I am going to do an inner_join() because I just want to add all columns of two different table that has common values:
df <- inner_join(mydata, mydf, by= c("name" = "name", "familu_name" = "family_name"))
*we are comparing data of two columns "name" and "family_name" in both tables to find out which one matches then inter all information in both tables for that row in a new table.