How to Perform an Exploratory Data Analysis using R?
In this article we will perform EDA analysis on installed power capacity of India. The data is fetched from data.gov.in visit https://data.gov.in/catalog/installed-capacity-power-generation for the dataset.
The data is updated on 20th Nov 2018. We will perform the analysis on installation pattern.
while performing a EDA make sure you mention a short description of the analysis you are going to perform as I have mentioned above.
we require few essentials to perform the analysis. Like Dplyr which is use for data mining so let us use dplyr. Please note if you have not installed dplyr please make sure you install it before you run any of the codes below.
Dplyr is used for data mining and ggplot is used for data visualization. You need good understanding of packages to understand the code below or read the document before going through the codes
1. Dplyr- https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
2. ggplot2- http://r-statistics.co/ggplot2-Tutorial-With-R.html
Loading the Data to R
I am going to read the dataset as df and use read.csv to load the dataset into R. To know more about it type ?read.csv() on your console.
df<-read.csv("MOP_installed_capacity_sector_mode_wise.csv",header = T,sep = ",")
Look and feel of the data
When you initially load the data make sure you see the dimension of the data. what are the columns involved in the dataset etc. Here I am just seeing the head(). You can tey dim(), Summary() etc.
## Date State Sector Mode ## 1 26-11-2018 Andaman & Nicobar Islands STATE SECTOR Thermal ## 2 26-11-2018 Andaman & Nicobar Islands STATE SECTOR Nuclear ## 3 26-11-2018 Andaman & Nicobar Islands STATE SECTOR Hydro ## 4 26-11-2018 Andaman & Nicobar Islands STATE SECTOR RES ## 5 26-11-2018 Andaman & Nicobar Islands PVT SECTOR Thermal ## 6 26-11-2018 Andaman & Nicobar Islands PVT SECTOR Nuclear ## Installed.Capacity ## 1 40.048 ## 2 0.000 ## 3 0.000 ## 4 5.250 ## 5 0.000 ## 6 0.000
Now the data contain columns like Data, State, Sector and Mode. To perform any EDA you need some basic understanding of the domain . In this data we have the date which is of today (date i have downloaded the data). State column contain all the states where the installation happened. Sector contain various organization taking part generating power Central, Sate and Private. Mode contain all the types of power generation technique. Installed Capacity is the capacity added to the sector.
To perform a EDA you need to come up or brain storm ideas and ask what are the insights I can fetch from this data. And you start perform data manipulation and make the data vomit the insights.
Now let us ask a question and perform the analysis
Before doing any analysis we will filter the data and choose which is are non zero installed capacity. The reason filtering non zero installed capacity is to ease the data visualization as it do not have unwanted stated etc
Idea 1: Which state received the maximum installation? aka State Vs Capacity Installed
To perform this you select state and installed capacity and sum up. Example of the code is provided below.
d<-newdf %>% group_by(State) %>% summarize(Capacity_Installed= sum(Installed.Capacity)) ggplot(d,aes(x=State,y=Capacity_Installed))+geom_bar(stat = "identity")+theme(axis.text.x = element_text(angle = 90, hjust = 1))
Now we can see that Maharashtra, Gujarat and Tamil Nadu have the higest total capacity added in MW.
Idea 2: Mode Vs Capacity Installed?
The next question arise what is the energy contribution from each mode of electricity generation and its proportion?
d<-newdf %>% group_by(Mode) %>% summarize(Capacity_Installed= sum(Installed.Capacity)) ggplot(d,aes(x=Mode,y=Capacity_Installed))+geom_bar(stat = "identity")+theme(axis.text.x = element_text(angle = 90, hjust = 1))
Please Note: RES = Renewable Energy Source
By looking at the graph we can see Thermal contribute the highest proportion compared to other mode of generation.
Idea 3: Which Mode of electricity genaration is widly choosen?
Now we will to find which mode of electricity is most preferred in terms of choice aka which is widely adopted?
d<-newdf %>% group_by(Mode) %>% tally() ggplot(d,aes(x=Mode,y=n))+geom_bar(stat = "identity")+theme(axis.text.x = element_text(angle = 90, hjust = 1))
I have just shown couple of possible ways on how we can perform EDA and generate new ideas? There is endless possibility and try various combinations and visualization techniques to try out.
Few are like changing colour in the graphs, adding percentage etc. I have not touched Sector coloums in the dataset. You try and let me know the insights in the comments below.
Happy learing and all the best!