The post A Beginner’s Guide to Channel Attribution Modeling in Marketing (using Markov Chains, with a case study in R) appeared first on Perceptive Analytics.

]]>In a typical ‘from think to buy’ customer journey, a customer goes through multiple touch points before zeroing in on the final product to buy. This is even more prominent in the case of e-commerce sales. It is relatively easier to track which are the different touch points the customer has encountered before making the final purchase.

As marketing moves more and more towards the consumer driven side of things, identifying the right channels to target customers has become critical for companies. This helps companies optimise their marketing spend and target the right customers in the right places.

More often than not, companies usually invest in the last channel which customers encounter before making the final purchase. However, this may not always be the right approach. There are multiple channels preceding that channel which eventually drive the customer conversion. The underlying concept to study this behavior is known as ‘multi-channel attribution modeling.’

In this article, we look at what channel attribution is and how it ties into the concept of Markov chains. We’ll also take a case study of an e-commerce company to understand how this concept works, both theoretically and practically (using R).

- What is Channel Attribution?
- Markov Chains
- Removal Effect

- Case Study of an E-Commerce Company
- Implementation in R

Google Analytics offers a standard set of rules for attribution modeling. As per Google, *“An attribution model is the rule, or set of rules, that determines how credit for sales and conversions is assigned to touchpoints in conversion paths. For example, the Last Interaction model in Analytics assigns 100% credit to the final touchpoints (i.e., clicks) that immediately precede sales or conversions. In contrast, the First Interaction model assigns 100% credit to touchpoints that initiate conversion paths.”*

We will see the last interaction model and first interaction model later in this article. Before that, let’s take a small example and understand channel attribution a little further. Let’s say we have a transition diagram as shown below:

In the above scenario, a customer can either start their journey through channel ‘C1’ or channel ‘C2’. The probability of starting with either C1 or C2 is 50% (or 0.5) each. Let’s calculate the overall probability of conversion first and then go further to see the effect of each of the channels.

P(conversion) = P(C1 -> C2 -> C3 -> Conversion) + P(C2 -> C3 -> Conversion)

= 0.5*0.5*1*0.6 + 0.5*1*0.6

= 0.15 + 0.3

= 0.45

Markov chains is a process which maps the movement and gives a probability distribution, for moving from one state to another state. A Markov Chain is defined by three properties:

**State space**– set of all the states in which process could potentially exist**Transition operator**–the probability of moving from one state to other state**Current state probability distribution**– probability distribution of being in any one of the states at the start of the process

We know the stages through which we can pass, the probability of moving from each of the paths and we know the current state. This looks similar to Markov chains, doesn’t it?

This is, in fact, an application of a Markov chains. We will come back to this later; let’s stick to our example for now. If we were to figure out what is the contribution of channel 1 in our customer’s journey from start to end conversion, we will use the principle of **removal effect**. *Removal effect principle says that if we want to find the contribution of each channel in the customer journey, we can do so by removing each channel and see how many conversions are happening without that channel being in place.*

For example, let’s assume we have to calculate the contribution of channel C1. We will remove the channel C1 from the model and see how many conversions are happening without C1 in the picture, viz-a-viz total conversion when all the channels are intact. Let’s calculate for channel C1:

P(Conversion after removing C1) = P(C2 -> C3 -> Convert)

= 0.5*1*0.6

= 0.3

30% customer interactions can be converted without channel C1 being in place; while with C1 intact, 45% interactions can be converted. So, the removal effect of C1 is

*0.3/0.45 = 0.666*.

The removal effect of C2 and C3 is 1 (you may try calculating it, but think intuitively. If we were to remove either C2 or C3, will we be able to complete any conversion?).

This is a very useful application of Markov chains. In the above case, all the channels – C1, C2, C3 (at different stages) – are called **transition states**; while the probability of moving from one channel to another channel is called **transition probability**.

Customer journey, which is a sequence of channels, can be considered as a chain in a directed Markov graph where each vertex is a state (channel/touch-point), and each edge represents transition probability of moving from one state to another. Since the probability of reaching a state depends only on the previous state, it can be considered as a memory-less Markov chain.

Let’s take a real-life case study and see how we can implement channel attribution modeling.

An e-commerce company conducted a survey and collected data from its customers. This can be considered as representative population. In the survey, the company collected data about the various touch points where customers visit before finally purchasing the product on its website.

In total, there are 19 channels where customers can encounter the product or the product advertisement. After the 19 channels, there are three more cases:

- #20 – customer has decided which device to buy;
- #21 – customer has made the final purchase, and;
- #22 – customer hasn’t decided yet.

The overall categories of channels are as below:

Category |
Channel |

Website (1,2,3) | Company’s website or competitor’s website |

Research Reports (4,5,6,7,8) | Industry Advisory Research Reports |

Online/Reviews (9,10) | Organic Searches, Forums |

Price Comparison (11) | Aggregators |

Friends (12,13) | Social Network |

Expert (14) | Expert online or offline |

Retail Stores (15,16,17) | Physical Stores |

Misc. (18,19) | Others such as Promotional Campaigns at various location |

Now, we need to help the e-commerce company in identifying the right strategy for investing in marketing channels. Which channels should be focused on? Which channels should the company invest in? We’ll figure this out using R in the following section.

Let’s move ahead and try the implementation in R and check the results. You can download the dataset here and follow along as we go.

#Install the libraries install.packages("ChannelAttribution") install.packages("ggplot2") install.packages("reshape") install.packages("dplyr") install.packages("plyr") install.packages("reshape2") install.packages("markovchain") install.packages("plotly") #Load the libraries library("ChannelAttribution") library("ggplot2") library("reshape") library("dplyr") library("plyr") library("reshape2") library("markovchain") library("plotly") #Read the data into R > channel = read.csv("Channel_attribution.csv", header = T) > head(channel)

Output:

R05A.01 |
R05A.02 |
R05A.03 |
R05A.04 |
….. |
R05A.18 |
R05A.19 |
R05A.20 |

16 | 4 | 3 | 5 | NA | NA | NA | |

2 | 1 | 9 | 10 | NA | NA | NA | |

9 | 13 | 20 | 16 | NA | NA | NA | |

8 | 15 | 20 | 21 | NA | NA | NA | |

16 | 9 | 13 | 20 | NA | NA | NA | |

1 | 11 | 8 | 4 | NA | NA | NA |

We will do some data processing to bring it to a stage where we can use it as an input in the model. Then, we will identify which customer journeys have gone to the final conversion (in our case, all the journeys have reached final conversion state).

We will create a variable ‘path’ in a specific format which can be fed as an input to the model. Also, we will find out the total occurrences of each path using the ‘dplyr’ package.

> for(row in 1:nrow(channel)) { if(21 %in% channel[row,]){channel$convert[row] = 1} } > column = colnames(channel) > channel$path = do.call(paste, c(channel[column], sep = " > ")) > head(channel$path) [1] "16 > 4 > 3 > 5 > 10 > 8 > 6 > 8 > 13 > 20 > 21 > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > 1" [2] "2 > 1 > 9 > 10 > 1 > 4 > 3 > 21 > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > 1" [3] "9 > 13 > 20 > 16 > 15 > 21 > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > 1" [4] "8 > 15 > 20 > 21 > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > 1" [5] "16 > 9 > 13 > 20 > 21 > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > 1" [6] "1 > 11 > 8 > 4 > 9 > 21 > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > NA > 1"

> for(row in 1:nrow(channel)) { channel$path[row] = strsplit(channel$path[row], " > 21")[[1]][1] } > channel_fin = channel[,c(23,22)] > channel_fin = ddply(channel_fin,~path,summarise, conversion= sum(convert)) > head(channel_fin)

Output:

path |
conversion |

1 > 1 > 1 > 20 | 1 |

1 > 1 > 12 > 12 | 1 |

1 > 1 > 14 > 13 > 12 > 20 | 1 |

1 > 1 > 3 > 13 > 3 > 20 | 1 |

1 > 1 > 3 > 17 > 17 | 1 |

> 1 > 6 > 1 > 12 > 20 > 12 | 1 |

> Data = channel_fin > head(Data)

Output:

path |
conversion |

1 > 1 > 1 > 20 | 1 |

1 > 1 > 12 > 12 | 1 |

1 > 1 > 14 > 13 > 12 > 20 | 1 |

1 > 1 > 3 > 13 > 3 > 20 | 1 |

1 > 1 > 3 > 17 > 17 | 1 |

1 > 1 > 6 > 1 > 12 > 20 > 12 | 1 |

Now, we will create a heuristic model and a Markov model, combine the two, and then check the final results.

> H <- heuristic_models(Data, 'path', 'conversion', var_value='conversion') > H

Output:

channel_name |
first_touch_conversions |
….. |
linear_touch_conversions |
linear_touch_value |

1 | 130 | 73.773661 | 73.773661 | |

20 | 0 | 473.998171 | 473.998171 | |

12 | 75 | 76.127863 | 76.127863 | |

14 | 34 | 56.335744 | 56.335744 | |

13 | 320 | 204.039552 | 204.039552 | |

3 | 168 | 117.609677 | 117.609677 | |

17 | 31 | 76.583847 | 76.583847 | |

6 | 50 | 54.707124 | 54.707124 | |

8 | 56 | 53.677862 | 53.677862 | |

10 | 547 | 211.822393 | 211.822393 | |

11 | 66 | 107.109048 | 107.109048 | |

16 | 111 | 156.049086 | 156.049086 | |

2 | 199 | 94.111668 | 94.111668 | |

4 | 231 | 250.784033 | 250.784033 | |

7 | 26 | 33.435991 | 33.435991 | |

5 | 62 | 74.900402 | 74.900402 | |

9 | 250 | 194.07169 | 194.07169 | |

15 | 22 | 65.159225 | 65.159225 | |

18 | 4 | 5.026587 | 5.026587 | |

19 | 10 | 12.676375 | 12.676375 |

> M <- markov_model(Data, 'path', 'conversion', var_value='conversion', order = 1)> M

Output:

channel_name |
total_conversion |
total_conversion_value |

1 | 82.482961 | 82.482961 |

20 | 432.40615 | 432.40615 |

12 | 83.942587 | 83.942587 |

14 | 63.08676 | 63.08676 |

13 | 195.751556 | 195.751556 |

3 | 122.973752 | 122.973752 |

17 | 83.866724 | 83.866724 |

6 | 63.280828 | 63.280828 |

8 | 61.016115 | 61.016115 |

10 | 209.035208 | 209.035208 |

11 | 118.563707 | 118.563707 |

16 | 158.692238 | 158.692238 |

2 | 98.067199 | 98.067199 |

4 | 223.709091 | 223.709091 |

7 | 41.919248 | 41.919248 |

5 | 81.865473 | 81.865473 |

9 | 179.483376 | 179.483376 |

15 | 70.360777 | 70.360777 |

18 | 5.950827 | 5.950827 |

19 | 15.545424 | 15.545424 |

Before going further, let’s first understand what a few of the terms we’ve seen above mean.

**First Touch Conversion:** The conversion happening through the channel when that channel is the first touch point for a customer. 100% credit is given to the first touch point.

**Last Touch Conversion:** The conversion happening through the channel when that channel is the last touch point for a customer. 100% credit is given to the last touch point.

**Linear Touch Conversion:** All channels/touch points are given equal credit in the conversion.

Getting back to the R code, let’s merge the two models and represent the output in a visually appealing manner which is easier to understand.

# Merges the two data frames on the "channel_name" column. R <- merge(H, M, by='channel_name') # Select only relevant columns R1 <- R[, (colnames(R) %in %c('channel_name', 'first_touch_conversions', 'last_touch_conversions', 'linear_touch_conversions', 'total_conversion'))] # Transforms the dataset into a data frame that ggplot2 can use to plot the outcomes R1 <- melt(R1, id='channel_name')

# Plot the total conversions ggplot(R1, aes(channel_name, value, fill = variable)) + geom_bar(stat='identity', position='dodge') + ggtitle('TOTAL CONVERSIONS') + theme(axis.title.x = element_text(vjust = -2)) + theme(axis.title.y = element_text(vjust = +2)) + theme(title = element_text(size = 16)) + theme(plot.title=element_text(size = 20)) + ylab("")

The scenario is clearly visible from the above graph. From the first touch conversion perspective, channel 10, channel 13, channel 2, channel 4 and channel 9 are quite important; while from the last touch perspective, channel 20 is the most important (in our case, it should be because the customer has decided which product to buy). In terms of linear touch conversion, channel 20, channel 4 and channel 9 are coming out to be important. From the total conversions perspective, channel 10, 13, 20, 4 and 9 are quite important.

In the above chart we have been able to figure out which are the important channels for us to focus on and which can be discarded or ignored. This case gives us a very good insight into the application of Markov chain models in the customer analytics space. E-commerce companies can now confidently create their marketing strategy and distribute their marketing budget using data driven insights.

The post A Beginner’s Guide to Channel Attribution Modeling in Marketing (using Markov Chains, with a case study in R) appeared first on Perceptive Analytics.

]]>The post Creating & Visualizing Neural Network in R appeared first on Perceptive Analytics.

]]>Neural network is an information-processing machine and can be viewed as analogous to human nervous system. Just like human nervous system, which is made up of interconnected neurons, a neural network is made up of interconnected information processing units. The information processing units do not work in a linear manner. In fact, neural network draws its strength from parallel processing of information, which allows it to deal with non-linearity. Neural network becomes handy to infer meaning and detect patterns from complex data sets.

Neural network is considered as one of the most useful technique in the world of data analytics. However, it is complex and is often regarded as a black box, i.e. users view the input and output of a neural network but remain clueless about the knowledge generating process. We hope that the article will help readers learn about the internal mechanism of a neural network and get hands-on experience to implement it in R.

- The Basics of Neural Network
- Fitting Neural Network in R
- Cross Validation of a Neural Network

A neural network is a model characterized by an activation function, which is used by interconnected information processing units to transform input into output. A neural network has always been compared to human nervous system. Information in passed through interconnected units analogous to information passage through neurons in humans. The first layer of the neural network receives the raw input, processes it and passes the processed information to the hidden layers. The hidden layer passes the information to the last layer, which produces the output. The advantage of neural network is that it is adaptive in nature. It learns from the information provided, i.e. trains itself from the data, which has a known outcome and optimizes its weights for a better prediction in situations with unknown outcome.

A perceptron, viz. single layer neural network, is the most basic form of a neural network. A perceptron receives multidimensional input and processes it using a weighted summation and an activation function. It is trained using a labeled data and learning algorithm that optimize the weights in the summation processor. A major limitation of perceptron model is its inability to deal with non-linearity. A multilayered neural network overcomes this limitation and helps solve non-linear problems. The input layer connects with hidden layer, which in turn connects to the output layer. The connections are weighted and weights are optimized using a learning rule.

There are many learning rules that are used with neural network:

a) least mean square;

b) gradient descent;

c) newton’s rule;

d) conjugate gradient etc.

The learning rules can be used in conjunction with backpropgation error method. The learning rule is used to calculate the error at the output unit. This error is backpropagated to all the units such that the error at each unit is proportional to the contribution of that unit towards total error at the output unit. The errors at each unit are then used to optimize the weight at each connection. Figure 1 displays the structure of a simple neural network model for better understanding.

Figure 1 A simple neural network model

Now we will fit a neural network model in R. In this article, we use a subset of cereal dataset shared by Carnegie Mellon University (CMU). The details of the dataset are on the following link: http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html. The objective is to predict rating of the cereals variables such as calories, proteins, fat etc. The R script is provided side by side and is commented for better understanding of the user. . The data is in .csv format and can be downloaded by clicking: cereals.

** **Please set working directory in R using *setwd( )* function, and keep cereal.csv in the working directory. We use rating as the dependent variable and calories, proteins, fat, sodium and fiber as the independent variables. We divide the data into training and test set. Training set is used to find the relationship between dependent and independent variables while the test set assesses the performance of the model. We use 60% of the dataset as training set. The assignment of the data to training and test set is done using random sampling. We perform random sampling on R using *sample ( )* function. We have used *set.seed( ) *to generate same random sample everytime and * * maintain consistency. We will use the *index* variable while fitting neural network to create training and test data sets. The R script is as follows:

## Creating index variable# Read the Data data = read.csv("cereals.csv", header=T) # Random sampling samplesize = 0.60 * nrow(data) set.seed(80) index = sample( seq_len ( nrow ( data ) ), size = samplesize ) # Create training and test set datatrain = data[ index, ] datatest = data[ -index, ]

Now we fit a neural network on our data. We use *neuralnet *library for the analysis. The first step is to scale the cereal dataset. The scaling of data is essential because otherwise a variable may have large impact on the prediction variable only because of its scale. Using unscaled may lead to meaningless results. The common techniques to scale data are: min-max normalization, Z-score normalization, median and MAD, and tan-h estimators. The min-max normalization transforms the data into a common range, thus removing the scaling effect from all the variables. Unlike Z-score normalization and median and MAD method, the min-max method retains the original distribution of the variables. We use min-max normalization to scale the data. The R script for scaling the data is as follows.

## Scale data for neural networkmax = apply(data , 2 , max) min = apply(data, 2 , min) scaled = as.data.frame(scale(data, center = min, scale = max - min))

The scaled data is used to fit the neural network. We visualize the neural network with weights for each of the variable. The R script is as follows.

## Fit neural network# install library install.packages("neuralnet ") # load library library(neuralnet) # creating training and test set trainNN = scaled[index , ] testNN = scaled[-index , ] # fit neural network set.seed(2) NN = neuralnet(rating ~ calories + protein + fat + sodium + fiber, trainNN, hidden = 3 , linear.output = T ) # plot neural network plot(NN)

Figure 3 visualizes the computed neural network. Our model has 3 neurons in its hidden layer. The black lines show the connections with weights. The weights are calculated using the back propagation algorithm explained earlier. The blue line is the displays the bias term.

Figure 2 Neural Network

We predict the rating using the neural network model. The reader must remember that the predicted rating will be scaled and it must me transformed in order to make a comparison with real rating. We also compare the predicted rating with real rating using visualization. The RMSE for neural network model is 6.05. The reader can learn more about RMSE in another article, which can be accessed by clicking __here__. The R script is as follows:

## Prediction using neural networkpredict_testNN = compute(NN, testNN[,c(1:5)]) predict_testNN = (predict_testNN$net.result * (max(data$rating) - min(data$rating))) + min(data$rating) plot(datatest$rating, predict_testNN, col='blue', pch=16, ylab = "predicted rating NN", xlab = "real rating") abline(0,1) # Calculate Root Mean Square Error (RMSE) RMSE.NN = (sum((datatest$rating - predict_testNN)^2) / nrow(datatest)) ^ 0.5

Figure 3: Predicted rating vs. real rating using neural network

We have evaluated our neural network method using RMSE, which is a residual method of evaluation. The major problem of residual evaluation methods is that it does not inform us about the behaviour of our model when new data is introduced. We tried to deal with the “new data” problem by splitting our data into training and test set, constructing the model on training set and evaluating the model by calculating RMSE for the test set. The training-test split was nothing but the simplest form of cross validation method known as *holdout method*. A limitation of the *holdout method* is the variance of performance evaluation metric, in our case RMSE, can be high based on the elements assigned to training and test set.

The second commonly cross validation technique is *k-fold cross validation*. This method can be viewed as a recurring *holdout method. *The complete data is partitioned into k equal subsets and each time a subset is assigned as test set while others are used for training the model. Every data point gets a chance to be in test set and training set, thus this method reduces the dependence of performance on test-training split and reduces the variance of performance metrics. The extreme case of *k-fold cross validation* will occur when k is equal to number of data points. It would mean that the predictive model is trained over all the data points except one data point, which takes the role of a test set. This method of leaving one data point as test set is known as *leave-one-out cross validation.*

Now we will perform *k-fold cross-validation* on the neural network model we built in the previous section. The number of elements in the training set, *j*, are varied from 10 to 65 and for each *j*, 100 samples are drawn form the dataset. The rest of the elements in each case are assigned to test set. The model is trained on each of the 5600 training datasets and then tested on the corresponding test sets. We compute RMSE of each of the test set. The RMSE values for each of the set is stored in a Matrix[100 X 56]. This method ensures that our results are free of any sample bias and checks for the robustness of our model. We employ nested for loop.

## Cross validation of neural network model# install relevant libraries install.packages("boot") install.packages("plyr") # Load libraries library(boot) library(plyr) # Initialize variables set.seed(50) k = 100 RMSE.NN = NULL List = list( ) # Fit neural network model within nested for loop for(j in 10:65){ for (i in 1:k) { index = sample(1:nrow(data),j ) trainNN = scaled[index,] testNN = scaled[-index,] datatest = data[-index,] NN = neuralnet(rating ~ calories + protein + fat + sodium + fiber, trainNN, hidden = 3, linear.output= T) predict_testNN = compute(NN,testNN[,c(1:5)]) predict_testNN = (predict_testNN$net.result*(max(data$rating)-min(data$rating)))+min(data$rating) RMSE.NN [i]<- (sum((datatest$rating - predict_testNN)^2)/nrow(datatest))^0.5 } List[[j]] = RMSE.NN } Matrix.RMSE = do.call(cbind, List)

The RMSE values can be accessed using the variable Matrix.RMSE. The size of the matrix is large; therefore we will try to make sense of the data through visualizations. First, we will prepare a boxplot for one of the columns in Matrix.RMSE, where training set has length equal to 65. One can prepare these box plots for each of the training set lengths (10 to 65).

## Prepare boxplotboxplot(Matrix.RMSE[,56], ylab = "RMSE", main = "RMSE BoxPlot (length of traning set = 65)")

Figure 4 Boxplot

The boxplot in Fig. 4 shows that the median RMSE across 100 samples when length of training set is fixed to 65 is 5.70. In the next visualization we study the variation of RMSE with the length of training set. We calculate the median RMSE for each of the training set length and plot them using the following R script.

## Variation of median RMSEinstall.packages("matrixStats") library(matrixStats) med = colMedians(Matrix.RMSE) X = seq(10,65) plot (med~X, type = "l", xlab = "length of training set", ylab = "median RMSE", main = "Variation of RMSE with length of training set")

Figure 5 Variation of RMSE

Figure 5 shows that the median RMSE of our model decreases as the length of the training the set. This is an important result. The reader must remember that the model accuracy is dependent on the length of training set. The performance of neural network model is sensitive to training-test split.

The article discusses the theoretical aspects of a neural network, its implementation in R and post training evaluation. Neural network is inspired from biological nervous system. Similar to nervous system the information is passed through layers of processors. The significance of variables is represented by weights of each connection. The article provides basic understanding of back propagation algorithm, which is used to assign these weights. In this article we also implement neural network on R. We use a publically available dataset shared by CMU. The aim is to predict the rating of cereals using information such as calories, fat, protein etc. After constructing the neural network we evaluate the model for accuracy and robustness. We compute RMSE and perform cross-validation analysis. In cross validation, we check the variation in model accuracy as the length of training set is changed. We consider training sets with length 10 to 65. For each length a 100 samples are random picked and median RMSE is calculated. We show that model accuracy increases when training set is large. Before using the model for prediction, it is important to check the robustness of performance through cross validation.

The post Creating & Visualizing Neural Network in R appeared first on Perceptive Analytics.

]]>The post Modelling Time Series Processes using GARCH appeared first on Perceptive Analytics.

]]>To go into the turbulent seas of volatile data and analyze it in a time changing setting, ARCH models were developed.

When techniques like linear regression or time series were aimed at modelling the general trend exhibited by a set or series of data points, data scientists faced another question – though these models can capture the overall trend but how can one model the volatility in the data? In real life, the initial stages in a business or a new market are always volatile and changing with a high velocity until things calm down and become saturated. It is then one can apply the statistical techniques such as time series analysis or regression as the case may be. To go into the turbulent seas of volatile data and analyze it in a time changing setting, ARCH models were developed.

As I already mentioned, ARCH is a statistical model for time series data. The proxy for volatility used by ARCH is variance (or standard deviation). The approach measures the variance of the error term with time. In modelling using ARCH methodology, the error term is assumed to follow an AR (Autoregressive) model. This means that the error terms cannot have a Moving Average (MA) component. If they possess both AR and MA components, that is, if they follow an ARMA model, we use GARCH (Generalized ARCH) model for the terms. GARCH models are useful for modelling market data such as stock markets and other financial instruments. Let’s learn a few more interesting peculiarities about volatility

When we are looking at variance of error terms, there can be a lot of patterns and one of the most common among them is a repetitive one. Volatility clustering, as it is called is a pattern which comes from clustered volatility periods or, in other words, repeating pattern of high and low volatility periods. GARCH models quite suitably capture volatility clustering trends in data. One needs to remember here that whether ARCH and GARCH are applied, they do not explain trends in error terms but only capture them. This also means that GARCH is more focussed on the occurrence of spikes and troughs than their level. You can know when we can witness a possible decline or steep rise but should not rely on how much will that change be. Naturally, such a problem requires a lot of data. We’re talking about tens of thousands of observations just to model the peaks.

Since GARCH is based on ARMA modelling, we use the GARCH(p,q) notation to indicate the AR and MA components. One of the most popular GARCH models is the GARCH(1,1) model. The exact values of p and q are then estimated using maximum likelihood. However, we do not generally depend on the assumption of normality of data rather, we use t- distribution which fits long tailed distributions. Other long tailed distributions are also suitable and can be used.

To test the goodness of fit, we usually check autocorrelation in squared standardized residuals. A robust test for this is the Ljung-Box test which calculates the Ljung-Box statistic and p-values. Another thing of interest in GARCH models is its persistence. It indicates how fast the volatile spikes decay after a shock and stabilize. In the typical GARCH(1,1) model, the key statistics is the sum of the two parameters commonly denoted as alpha1 and beta1. If the sum is greater than 1 then it means that the volatility will increase and explode instead of decay which is hardly the situation. A value exactly equal to 1 means an exponential decay model. In real life, most GARCH models have the sum less than 1.

We can also transform the persistence in terms of half-life. We know the half-life is the time in which half of the volatility decays. Hence, we use the log notation:

*half life = log(0.5)/log(alpha1 + beta1)*

Since log (1) = 0, if sum of alpha1 and beta1 is exactly equal to 1, the half life becomes infinite. What does it mean? Persistence and half life are derived from training data. If there is a trend in the volatility of the data in training data, then the estimator may be mistakenly calculate an infinite half life based on when it ends. This is another reason why we need tens of thousand of data points for modelling GARCH as a smaller sample will result in higher possibility of errors. These parameter estimates are very important as they are used to make predictions in test data and needs to be checked after model fitting.

All these may be a bit hard to digest. Let’s understand more concepts using a practical implementation in R.

There are a lot of garch packages since GARCH models are further specialized in many variations. It is difficult to understand and explain all of them. However, we will go through one of the most popular GARCH packages – fGarch. We will also use the package Ecdat for the Garch dataset. The package contains Garch data set Daily Observations on Exchange Rates of the US Dollar Against Other Currencies from 1 Jan, 1980 to 21 May, 1987, which is a sum total of 1867 observations.

#Install the Ecdat package

install.packages(“Ecdat”)

#Loading the library and the Garch dataset

library(Ecdat)

mydata=Garch

#Look at the dataset

str(mydata)

‘data.frame’: 1867 obs. of 8 variables:

$ date: int 800102 800103 800104 800107 800108 800109 800110 800111 800114 800115 …

$ day : chr “wednesday” “thursday” “friday” “monday” …

$ dm : num 0.586 0.584 0.584 0.585 0.582 …

$ ddm : num NA -0.004103 0.000856 0.001881 -0.004967 …

$ bp : num 2.25 2.24 2.24 2.26 2.26 …

$ cd : num 0.855 0.855 0.857 0.854 0.855 …

$ dy : num 0.00421 0.00419 0.00427 0.00432 0.00426 …

$ sf : num 0.636 0.636 0.635 0.637 0.633 …

We notice that the data types are a bit mismatched. We need to convert date to date format and day to factor before proceeding further. The rest of the features are exchange rates and are in correct format

#Correct the data types of date and day

#Correcting date fixes it to some arbitrary date such that the trend is same but the mapping is different

mydata$date=as.Date(mydata$date, origin = “01-02-1980”)

mydata$day=as.factor(mydata$day)

Let’s include the other packages. We will use fGarch function to perform our analysis

#Install packages and load them

install.packages(“tseries”)

install.packages(“urca”)

install.packages(“fUnitRoots”)

install.packages(“forecast”)

install.packages(“fGarch”)

library(fGarch) # estimate GARCH and Forecast

library(tseries) #used for time series data

library(urca) #Used for checking Unit root Cointegration

library(fUnitRoots) #Used for conducting unit root test

library(forecast) #Used for forecasting ARIMA model

#Converting Dollar – Deutsche mark exchange rate to time series

exchange_rate_dollar_deutsch_mark <- ts(mydata$dm, start=c(1980, 1), end=c(1987, 5), frequency=266) #Plot the time series plot.ts(exchange_rate_dollar_deutsch_mark, main=”exchange_rate_dollar_deutsch_mark”)

We have a lot of small variations across the years as visible from the plot. The next step is to start processing the data. For this, we take the difference of the values. Though we already have the ddm column which provides us the difference, I am calculating the difference separately as the log of the exchange rate and then multiplying it with 100 as it serves as a better representation of the variation. Remember, in economic terms, the difference of the exchange rates is also represented by inflation/deflation as the case may be.

#Calculate inflation as difference of log of exchange rate and then multiplied by 100

inflation_series<-(diff(log(exchange_rate_dollar_deutsch_mark)))*100 #Plot the inflation plot.ts(inflation_series, main=”Inflation of exchange rate”)

This is the inflation residual on which represents the variability in the original time series. There is a continuous variation without a definite trend or pattern. It even has some spikes such as the one between the years 1985 and 1986 of about 5.5. This is the series which can be adequately captured by using a GARCH model. To make things more clear, we will also see the summary statistics of the inflation series.

summary(inflation_series)

Min. 1st Qu. Median Mean 3rd Qu. Max.

-2.822000 -0.451700 -0.026770 -0.002183 0.428900 5.502000

The range goes from -2.8 to 5.5 with a mean of -0.002. This indicates that the series oscillates around 0 but has a lot of variability which is an ideal candidate for a GARCH modelling technique. To fit a GARCH, we need to identify the ARIMA model on which we will add GARCH. This calls for ACF and PACF plots

#ACF and PACF Plots

acf(inflation_series)

pacf(inflation_series)

pacf(inflation_series)

Looking at the plots, we can see the third, fourth and fifth lags to be fairly significant. It seems an ARIMA(5,0,0) model should be a good fit for the series. Let’s try and test with the Ljung-Box test

#Model fitting. We have selected the ARIMA(5,0,0) model

Arima_5_0_0<- arima(inflation_series[1:499], order = c(5,0,0)) #Check out the residuals residual <- Arima_5_0_0$resid acf(residual) pacf(residual)

Arima_5_0_0<- arima(inflation_series[1:499], order = c(5,0,0)) #Check out the residuals residual <- Arima_5_0_0$resid acf(residual) pacf(residual)

The first five lags are almost vanishing with no lag showing a significant impact in the ACF plot. Let’s double check with the Ljung Box test. The Ljung Box test is a statistical test to check if any of the autocorrelations are not zero. The higher the p-value, the more chance that all autocorrelations are zero. If the p-value is lower than 0.05, we generally assume with a 95% confidence that some autocorrelations are not zero and our ARIMA fit was inappropriate.

#Perform the Ljung Box test

Box.test(residual,c(20),”Ljung-Box”)

Box-Ljung testdata: residual

X-squared = 15.321, df = 20, p-value = 0.7578

Box.test(residual,c(20),”Ljung-Box”)

Box-Ljung testdata: residual

X-squared = 15.321, df = 20, p-value = 0.7578

The p-value is quite high and shows that the fit was good. Now is the time for GARCH fit over ARIMA(5,0,0). Let’s try the popular GARCH(1,1) model. To make things simpler, we will only use the first 500 observations. The garchFit function helps us achieve this

#Fitting GARCH over the first 500 data points

garch.fit <- garchFit(formula = ~arma(5,0)+garch(1,1), data = inflation_series[1:500])

garch.fit <- garchFit(formula = ~arma(5,0)+garch(1,1), data = inflation_series[1:500])

The advantage of fGarch package is that the garchFit function is very rich in functionality. Using the summary statistics provides a lot of tests and statistics. Using the plot function also provides with a lot of graphs and analysis. Let’s try using the plot function on the model

#Plot the model

plot(garch.fit)

Make a plot selection (or 0 to exit):1: Time Series 2: Conditional SD

3: Series with 2 Conditional SD Superimposed 4: ACF of Observations

5: ACF of Squared Observations 6: Cross Correlation

7: Residuals 8: Conditional SDs

9: Standardized Residuals 10: ACF of Standardized Residuals

11: ACF of Squared Standardized Residuals 12: Cross Correlation between r^2 and r

13: QQ-Plot of Standardized Residuals

As we can see, we are presented with a choice of plots. I recommend trying out all the options from 1 to 13 and noting the analysis. For simplicity, I will show a two plots: Time series and Series with 2 Conditional SD Superimposed

This is just the beginning and there are a lot of packages. Different packages have different applications. The fGarch package used in this article is abbreviated for Financial Garch and suited for modelling heteroskedasticity in financial time series such as the exchange rate used from the Garch dataset. Try working on other packages and keep learning. This article provides the required background for Garch modelling with implementation in a financial series.

**Here is the entire code** used for reference.

The post Modelling Time Series Processes using GARCH appeared first on Perceptive Analytics.

]]>The post Choropleth Maps in R appeared first on Perceptive Analytics.

]]>Choropleth maps provides a very simple and easy way to understand visualizations of a measurement across different geographical areas, be it states or countries.

If you were to compare growth rate of Indian states and present it to a bunch of people who have 15-20 seconds to look at it and infer insights from the data, what would be the right way? The best way? Would presenting the data in the traditional tabular format make sense? Or bar graphs would look better?

Bar graphs, indeed, will look better and present the data in visually appealing manner and provide a good comparison; but, will it make an impact in 15 seconds? I personally won’t be able to bring the desired outcome, moreover data for 36 states and union territories in 36 bars will make it cumbersome to scroll up and down. We have a much better alternative to table and bar charts, choropleth maps.

Choropleth maps are thematic maps in which different areas are colored or shaded in accordance with the value of a statistical variable being represented in the map. Taking an example, let’s say we were to compare population density in different states of the United States of America in a colorful manner, choropleth maps would be our best bet for representation. To sum it up, choropleth maps provides a very simple and easy way to understand visualizations of a measurement across different geographical areas, be it states or countries.

Let’s take some examples of choropleth maps and where they come handy in presenting data.

- Choropleth maps are widely used to represent macroeconomic variables such as GDP growth rate, population density, per-capita income, etc. on a world map and provide a proportional comparison among countries. This can also be done for states within a country.
- These maps can also be used to present nominal data such as gain/loss/no change in number of seats by an election party in a country.

One of the limitations of using choropleth maps is that they don’t provide details of total or absolute values. They are among the best for proportional comparison but when it comes to presenting absolute values, choropleth maps are not the right fit.

Now, let us try to see the practical implementation of choropleth maps in R. In the following code, we will try to achieve the following objectives as part of the overall implementation of the maps.

- Download and import the maps shape in R
- Creating our own dataset and representing it in the map of India
- Merging dataset and preparing it for visual representation
- Improving visualization
- Display external data on choropleth maps
- Presenting multiple maps at once

There are multiple sites from where you can download shape files for free. I used this site(http://www.diva-gis.org/gdata) for downloading administrative map of India for further processing. Once you download the file, unzip the file and set your R working directory to the unzipped folder.

We will install all the necessary libraries at once and discuss one by one as we proceed along.

# Install all necessary packages and load the libraries into R library(ggplot2) library(RColorBrewer) library(ggmap) library(maps) library(rgdal) library(scales) library(maptools) library(gridExtra) library(rgeos)

Set the working directory to the unzipped folder and use the following code to import the shape into R.

# Set working directory states_shape = readShapeSpatial("IND_adm1.shp") class(states_shape) names(states_shape) print(states_shape$ID_1) print(states_shape$NAME_1) plot(states_shape, main = "Administrative Map of India")

> class(states_shape) [1] "SpatialPolygonsDataFrame" attr(,"package") [1] "sp" > names(states_shape) [1] "ID_0" "ISO" "NAME_0" "ID_1" "NAME_1" "TYPE_1" "ENGTYPE_1" "NL_NAME_1" "VARNAME_1" > print(states_shape$ID_1) [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 > print(states_shape$NAME_1) [1] Andaman and Nicobar Andhra Pradesh Arunachal Pradesh Assam Bihar [6] Chandigarh Chhattisgarh Dadra and Nagar Haveli Daman and Diu Delhi [11] Goa Gujarat Haryana Himachal Pradesh Jammu and Kashmir [16] Jharkhand Karnataka Kerala Lakshadweep Madhya Pradesh [21] Maharashtra Manipur Meghalaya Mizoram Nagaland [26] Orissa Puducherry Punjab Rajasthan Sikkim [31] Tamil Nadu Telangana Tripura Uttar Pradesh Uttaranchal [36] West Bengal 36 Levels: Andaman and Nicobar Andhra Pradesh Arunachal Pradesh Assam Bihar Chandigarh Chhattisgarh ... West Bengal > plot(states_shape, main = "Administrative Map of India")

ID_1 provides a unique id for each of 36 states and union territories; while the NAME_1 provides the name of each of the states and union territories. We will be mainly using these two fields, other fields provide name of the country, code of the country and other information which separates data of one country from the other.

Alternatively, there is another function from different package which we can use to import shape into R.

States_shape2 = readOGR(".","IND_adm1") class(States_shape2) names(States_shape2) plot(States_shape2)

> States_shape2<-readOGR(".","IND_adm1") OGR data source with driver: ESRI Shapefile Source: ".", layer: "IND_adm1" with 36 features It has 9 fields Integer64 fields read as strings: ID_0 ID_1 > class(States_shape2) [1] "SpatialPolygonsDataFrame" attr(,"package") [1] "sp" > names(States_shape2) [1] "ID_0" "ISO" "NAME_0" "ID_1" "NAME_1" "TYPE_1" "ENGTYPE_1" "NL_NAME_1" "VARNAME_1" > plot(States_shape2)

In the above code “readOGR(“.”,”IND_adm1”), “.” means that the shapefile which we want to read is in our working directory; else, we would have to mention the entire path. Also, we need to mention the shapefile name without extension otherwise it will throw an error.

To begin with, we will create our own data for each of the 36 IDs and call it score D, a parameter which represents dancing talent of each of the states. (Please note that this score is randomly generated and does not reflect the true dancing talent :P).

# Creating our own dataset set.seed(100) State_count = length(states_shape$NAME_1) score_1 = sample(100:1000, State_count, replace = T) score_2 = runif(State_count, 1,1000) score = score_1 + score_2 State_data = data.frame(id=states_shape$ID_1, NAME_1=states_shape$NAME_1, score) State_data

> State_data id NAME_1 score 1 1 Andaman and Nicobar 558.2268 2 2 Andhra Pradesh 961.7615 3 3 Arunachal Pradesh 1586.5746 4 4 Assam 281.1586 5 5 Bihar 853.3299 6 6 Chandigarh 1400.2554 7 7 Chhattisgarh 1608.8069 8 8 Dadra and Nagar Haveli 1260.4761 9 9 Daman and Diu 1195.7210 10 10 Delhi 744.7406 11 11 Goa 1443.5782 12 12 Gujarat 1778.3428 13 13 Haryana 560.5062 14 14 Himachal Pradesh 766.7788 15 15 Jammu and Kashmir 1118.1993 16 16 Jharkhand 901.4804 17 17 Karnataka 520.4586 18 18 Kerala 697.6118 19 19 Lakshadweep 1014.7297 20 20 Madhya Pradesh 975.1373 21 21 Maharashtra 706.3637 22 22 Manipur 970.6760 23 23 Meghalaya 1182.9777 24 24 Mizoram 986.1971 25 25 Nagaland 942.2375 26 26 Orissa 901.4541 27 27 Puducherry 1754.6125 28 28 Punjab 1570.7218 29 29 Rajasthan 1039.7029 30 30 Sikkim 708.4160 31 31 Tamil Nadu 995.2757 32 32 Telangana 1381.9686 33 33 Tripura 659.8475 34 34 Uttar Pradesh 1653.6564 35 35 Uttaranchal 1138.8248 36 36 West Bengal 1229.3981

We will use the function fortify() of ggplot2 package to get the shape file into a data frame and then merge the data frame file and dataset together.

# Fortify file fortify_shape = fortify(states_shape, region = "ID_1") class(fortify_shape)

> fortify_shape = fortify(states_shape, region = "ID_1") > class(fortify_shape) [1] "data.frame"

#merge with coefficients and reorder Merged_data = merge(fortify_shape, State_data, by="id", all.x=TRUE) Map_plot = Merged_data[order(Merged_data$order), ]

Now, let’s create a basic visualization and see how our maps looks like.

ggplot() + geom_polygon(data = Map_plot, aes(x = long, y = lat, group = group, fill = score), color = "black", size = 0.5) + coord_map()

We will use some of the functions of packages ‘ggplot2’ and ‘ggmap’ to improve the visual appeal of maps that we have created.

Let’s begin by creating our first plot and then subsequently improve in the next plots by adding more features.

> #Plot 1 > ggplot() + + geom_polygon(data = Map_plot, + aes(x = long, y = lat, group = group, fill = score), + color = "black", size = 0.5) + + coord_map()+ + scale_fill_distiller(name="Score")+ + theme_nothing(legend = TRUE)+ + labs(title="Score in India - Distribution by State")

Let’s make our map a little more colorful so that it shows the distribution clearly. Using the function display.brewer.all() from the package ‘RColorBrewer’, gives us all the color palettes available in R. We can choose the one we like.

# Check Color palettes display.brewer.all()

Now, change the color palette and change the legend by adding more breaks.

> #Plot 2 > ggplot() + + geom_polygon(data = Map_plot, + aes(x = long, y = lat, group = group, fill = score), + color = " Dark Blue", size = 1) + + coord_map()+ + scale_fill_distiller(name="Score", palette = "Set3" , breaks = pretty_breaks(n = 7))+ + theme_nothing(legend = TRUE)+ + labs(title="Score in India - Distribution by State")

Pretty_breaks() is a function in ‘scales’ package which can help us in defining the number of breaks we want to see in the legend. In the above map, we have 7 breaks from 400 to 1600 at an interval of 200; while in the preceding graph there were only 3 breaks.

Now, add the state names to the graph to make it more appealing and illustrative.

> #Plot3 > ggplot() + + geom_polygon(data = Map_plot, + aes(x = long, y = lat, group = group, fill = score), + color = " Dark Blue", size = 1) + + coord_map()+ + scale_fill_distiller(name="Score", palette = "Set3" , breaks = pretty_breaks(n = 7))+ + theme_nothing(legend = TRUE)+ + labs(title="Score in India - Distribution by State")+ + geom_text(data=name_lat_lon, aes(long, lat, label = NAME_1), size=2)

We will now import external data and try to create choropleth maps for those data points. The dataset we are using provides following information for all the 36 states and union territories of India:

- ID
- State or union territory
- Population (2011 Census)
- Decadal growth (2001–2011)
- Area (km sq)
- Density (population per sq km)
- Sex ratio

ID of each state is same as the ID that has been assigned in the Merged_data created earlier.

> d1 = read.csv(file.choose(), header = T) > head(d1) ID State.or.union.territory Population..2011.Census. Decadal.growth..2001.2011. Area..km.sq. Density..population.per.sq.km. 1 1 Andaman and Nicobar Islands 379944 0.067 8249 827.1412 2 2 Andhra Pradesh 49386799 0.111 162968 365.1876 3 3 Arunachal Pradesh 1382611 0.259 83743 1102.3931 4 4 Assam 31169272 0.169 78438 1029.2471 5 5 Bihar 103804637 0.251 94163 235.5190 6 6 Chandigarh 1055450 0.171 114 554.6676 Sex.ratio 1 908 2 946 3 916 4 947 5 931 6 995

> #Merging with external source > state_data2<-data.frame(id=d1$ID, NAME_1=d1$State.or.union.territory, pop = d1$Population..2011.Census., growth=d1$Decadal.growth..2001.2011., area = d1$Area..km.sq., pop_density = d1$Density..population.per.sq.km., sex_ratio = d1$Sex.ratio) > head(state_data2) id NAME_1 pop growth area pop_density sex_ratio 1 1 Andaman and Nicobar Islands 379944 0.067 8249 827.1412 908 2 2 Andhra Pradesh 49386799 0.111 162968 365.1876 946 3 3 Arunachal Pradesh 1382611 0.259 83743 1102.3931 916 4 4 Assam 31169272 0.169 78438 1029.2471 947 5 5 Bihar 103804637 0.251 94163 235.5190 931 6 6 Chandigarh 1055450 0.171 114 554.6676 995

#Fortify file merged_data2<-merge(fortify_shape, state_data2, by="id", all.x=TRUE) map_plot2<-merged_data2[order(merged_data$order), ]

> ggplot() + + geom_polygon(data = map_plot2, + aes(x = long, y = lat, group = group, fill = pop/1000), + color = " Dark Blue", size = 0.5) + + coord_map()+ + scale_fill_distiller(name="Population", palette = "Set3")+ + theme_nothing(legend = TRUE)+ + labs(title="Population in India")+ + geom_text(data=name_lat_lon, aes(long, lat, label = NAME_1), size=2)

If we were to represent all the 5 measures in the map and see all the maps at once in a single chart, we will use function grid.arrange() of the package ‘gridExtra’. This will help us in presenting multiple maps at once. First, we will create all the five maps that we want to show and then use the function.

#Plotting multiple maps at once plot1 = ggplot() + geom_polygon(data = map_plot2, aes(x = long, y = lat, group = group, fill = pop/1000), color = " Dark Blue", size = 0.5) + coord_map()+ scale_fill_distiller(name="Population (in '000)", palette = "Set3")+ theme_nothing(legend = TRUE)+ labs(title="Population in India") plot2 = ggplot() + geom_polygon(data = map_plot2, aes(x = long, y = lat, group = group, fill = growth*100), color = " Dark Blue", size = 0.5) + coord_map()+ scale_fill_distiller(name="Decadal Growth (in %)", palette = "Set3")+ theme_nothing(legend = TRUE)+ labs(title="Decadal growth (in %) in India") plot3 = ggplot() + geom_polygon(data = map_plot2, aes(x = long, y = lat, group = group, fill = area/1000), color = " Dark Blue", size = 0.25) + coord_map()+ scale_fill_distiller(name="Area (in '000 Sq Km)", palette = "Set3")+ theme_nothing(legend = TRUE)+ labs(title="Area (in '000 sq km) in India") plot4= ggplot() + geom_polygon(data = map_plot2, aes(x = long, y = lat, group = group, fill = pop_density), color = " Dark Blue", size = 0.25) + coord_map()+ scale_fill_distiller(name="Population Density", palette = "Set3")+ theme_nothing(legend = TRUE)+ labs(title="Population Density in India") plot5 = ggplot() + geom_polygon(data = map_plot2, aes(x = long, y = lat, group = group, fill = sex_ratio), color = " Dark Blue", size = 0.25) + coord_map()+ scale_fill_distiller(name="Sex Ratio", palette = "Set3")+ theme_nothing(legend = TRUE)+ labs(title="Sex Ratio (per '000 males) in India")

library(gridExtra) grid.arrange(plot1, plot2, plot3, plot4, plot5)

The above examples show the flexibility and the convenience that choropleth maps provide us in presenting a measurement on geographical base. I have used the map of India as the base geographical region; the same process can be applied to any geographical base and data.

After going to the article, I am sure you will agree to my point with which I started the article – choropleth maps are the best bets when we want to leave a strong impression on the audience in 15 seconds. Don’t you?

The post Choropleth Maps in R appeared first on Perceptive Analytics.

]]>The post A Primer on Web Scraping in R appeared first on Perceptive Analytics.

]]>If you are a data scientist who wants to capture data from such web pages then you wouldn’t want to be the one to open all these pages manually and scrape the web pages one by one. To push away the boundaries limiting data scientists from accessing such data from web pages, there are packages available in R.

The more data you collect, the better your models, but what if the data you want resides on a website? This is the problem of social media analysis when the data comes from users posting content online and can be very unstructured. While there are some websites who support data collection from their web pages and have even exposed packages and APIs (such as Twitter), most of the web pages lack the capability and infrastructure for this. If you are a data scientist who wants to capture data from such web pages then you wouldn’t want to be the one to open all these pages manually and scrape the web pages one by one. To push away the boundaries limiting data scientists from accessing such data from web pages, there are packages available in R. They are based on a technique known as ‘Web scraping’ which is a method to convert the data, whether structured or unstructured, from HTML into a form on which analysis can be performed. Let us look into web scraping technique using R.

Before diving into web scraping with R, one should know that this area is an advanced topic to begin working on in my opinion. It is absolutely necessary to have a working knowledge of R. Hadley Wickham authored the rvest package for web scraping using R which I will be demonstrating in this article.The package also requires ‘selectr’ and ‘xml2’ packages to be installed. Let’s install the package and load it first.

#Installing the web scraping package rvest install.packages("rvest") library(rvest)

The way rvest works is straightforward and simple. Much like the way you and me manually scrape web pages, rvest requires identifying the webpage link as the first step. The pages are then read and appropriate tags need to be identified. We know that HTML language organizes its content using various tags and selectors. These selectors need to be identified and marked so that their content is stored by the rvest package. We can then convert all the scraped data into a data frame and perform our analysis. Let’s take an example of capturing the content from a blog page – the PGDBA wordpress blog for analytics. We will look at one of the pages from their experiences section. The link to the page is: http://pgdbablog.wordpress.com/2015/12/10/pre-semester-at-iim-calcutta/

As the first step mentioned earlier, I store the web address in a variable url and pass it to the read_html() function. The url is read into memory similar to the way we read csv files using read.csv() function.

#Specifying the url for desired website to be scrapped url <- 'http://pgdbablog.wordpress.com/2015/12/10/pre-semester-at-iim-calcutta/' #Reading the HTML code from the website webpage <- read_html(url)

Web scraping starts after the url has been read. However, a web page can contain a lot of content and we may not need everything. This is why web scraping is performed for targeted content. For this, we use the selector gadget. The selector gadget now has an extension in chrome and is used to pinpoint the names of the tags which we want to capture. If you don’t have the selector gadget and have not used it, you can read about it using the command in R. You can also install the gadget by going to the website http://selectorgadget.com/

#Know about the selector gadget vignette("selectorgadget")

After installing the selector gadget, open the webpage and click on the content which you want to capture. Based on the content selected, the selector gadget generates the tag which was used to store it in HTML. The content can then be scraped by mentioning the tag (also known as CSS selector) in html_nodes() function and converting it into html_text. The sample code in R looks like this:

#Using the CSS selector (using ‘www.imdb.com’ website in this example) rating_html=html_nodes(webpage,'.imdb-rating') #’.imdb-rating’ is taken from CSS selector #Converting the rating data to text rating <- html_text(rating_html) #Check the rating captured rating

Simple! Isn’t it? Let’s take a step further and capture the content our target webpage!

I choose a blog page because it is all text and serves as a good starting example. Let’s begin by capturing the date on which the article was posted. Using the selector gadget, clicking on the date revealed that the tag required to get this data was .entry-date

#Using CSS selectors to scrap the post date post_date_html <- html_nodes(webpage,'.entry-date') #Converting the post date to text post_date <- html_text(post_date_html) #Verify the date captured post_date "December 10, 2015"

It’s an old post! The next step is to capture the headings. However, there are two headings here. One is the title of the article and other is the summary. Interestingly, both of them can be identified using the same tag. The beauty of rvest package comes here that it can capture both of the headings in one go. Let’s perform this step

#Using CSS selectors to scrap the title and title summary sections title_summary_html <- html_nodes(webpage,'em') #Converting the title data to text title_summary <- html_text(title_summary_html) #Check the title of the article title_summary[2] #Read the title summary of the article title_summary[1]

The main title is stored as the second value in the *title_summary vector. *The first value contains the summary of the data. With this, the only section remaining is the main content. This is probably organized using the paragraph tag. We will use the ‘p’ tag to capture all of it.

#Using CSS selectors to scrap the blog content content_data_html <- html_nodes(webpage,'p') #Converting the blog content data to text content_data <- html_text(content_data_html) #Let's see how much content we have captured length(content_data) #the output is 38

We see that the content_data has a length of 38. However, the website shows that there are only 11 paragraphs in the main content. Additional paragraphs which are captured are actually the comments, likes and other content after the main blog post. For our purposes, we will only read the first 11 values of the content data and not use the remaining text in our data frame.

#Reading the content of the article content_data[1] content_data[2] content_data[3] content_data[4] content_data[5] content_data[6] content_data[7] content_data[8] content_data[9] content_data[10] content_data[11]

Since we have captured the comments section, let us see how many comments were made. The selector gadget helps us to know that the .fn tag can be used to note the names of people who commented on the article.

#Using CSS selectors to scrap the names of people who commented comments_html <- html_nodes(webpage,'.fn') #Converting the commenters to text comments <- html_text(comments_html) #Let's have a look at all the names comments #What are the total number of comments made? length(comments) #8 comments #How many different people made comments? length(unique(comments)) #6 people

This is consistent with the article where Gautam Kumar, the author of the article and pgdbaunofficial, the page owner made multiple comments. We will now try to convert our data into data frame.

#convert all the data into a data frame first_blog<-data.frame(Date = post_date, Title = title_summary[2],Description = title_summary[1], content=paste(content_data[1:11], collapse = ''), commenters=length(comments)) #Checking the structure of the data frame str(first_blog) #all the features are factors and need to be converted into appropriate types

This is a simple data frame with only five columns – Date, title, description, content and number of commenters. As long as we remain on the same website, the same code can be suitably reused for all the articles. However, for a different website, we may need a different piece of code. Let’s try another blog from the same blog first. The link is https://pgdbablog.wordpress.com/2015/12/18/pgdba-chronicles-first-semester/

#Specifying the url for desired website to be scrapped url <- 'http://pgdbablog.wordpress.com/2015/12/18/pgdba-chronicles-first-semester/' #Reading the HTML code from the website webpage <- read_html(url) #Using CSS selectors to scrap the rankings section post_date_html <- html_nodes(webpage,'.entry-date') #Converting the ranking data to text post_date <- html_text(post_date_html) #Let's have a look at the rankings post_date #Using CSS selectors to scrap the title section title_summary_html <- html_nodes(webpage,'em') #Converting the title data to text title_summary <- html_text(title_summary_html) #Let's have a look at the title title_summary[1] title_summary[2] title_summary[3] title_summary[4] title_summary[5] title_summary[6]

This one has six titles – the first one is the summary, the next four are captions to the images and the last is the title heading. We can also capture the content and the comments similarly. Let’s try to capture the images which are new in this post. The selector gadget shows the images have tags from ‘.wp-image-51’ to ‘.wp-image-54’. Let’s download the last image. I am going to use an alternative approach where I set an html_session using the url

#Setting an html_session webpage <- html_session(url) #Getting the image using the tag Image_link <- webpage %>% html_nodes(".wp-image-54") #Fetch the url to the image img.url <- Image_link[1] %>% html_attr("src") #Save the image as a jpeg file in the working directory download.file(img.url, "test.jpg", mode = "wb")

As the last webpage, we will move out of the blog and use a more content rich page. This time, we will capture the data from moneyball page in the imdb website. The link which we need is:http://www.imdb.com/title/tt1210166/

Imdb stores its content into well organized tags such as #titleDetails, #titleDidYouKnow, #titleCast, etc. This makes it easy to scrape the page by specifying whichever content we need. The cast is also displayed in the form of a table and we can use a table tag to capture the cast. Let’s capture the cast using the tag versus using the table tag.

#Scraping the cast using titleCast tag cast_html = html_nodes(webpage,"#titleCast .itemprop span") #Convert to text cast=html_text(cast_html) #Let’s see our cast for the moneyball movie cast #scraping the cast using table tag cast_table_html = html_nodes(webpage,"table") #Converting the cast to a table cast_table=html_table(cast_table_html) #Checking the first table on the website which represents the cast cast_table[[1]]

We see that there are no major differences with the only one being that the cast_table is formatted in the form of a table because we used html_table function instead of html_tag function.

We can do a lot with web scraping if we know the right way to do it. The rvest package makes it very easy to scrape pages and capture content in the form of data frames or files. Besides scraping blogs and rating websites, we can also automate mundane tasks such as scrape jobs from job websites or content from LinkedIn. Most of the tasks which are focussed on web scraping target getting data from the web pages and then using them for analysis. We can also scrape pages by using xml instead of html selectors as an alternative. In this case, the ‘table’ tag will become the //table xml tag and data can be scraped in a similar fashion. The data captured, when converted to data frames can be then used for analysis and get more knowledge about what is happening on social media today. In the end, the process remains the same – find the web page, identify the tags to be captured, convert them to text and store them into a data frame. I’m sure this article made web scraping easier than when you first started reading it.

**Here is the code for the web scraping tasks used in this article**

The post A Primer on Web Scraping in R appeared first on Perceptive Analytics.

]]>The post Propensity Score Matching in R appeared first on Perceptive Analytics.

]]>Propensity scores are an alternative method to estimate the effect of receiving treatment when random assignment of treatments to subjects is not feasible.

The concept of Propensity score matching (PSM) was first introduced by Rosenbaum and Rubin (1983) in a paper entitled “The Central Role of the Propensity Score in Observational Studies for Casual Effects.”

Propensity scores are an alternative method to estimate the effect of receiving treatment when random assignment of treatments to subjects is not feasible. PSM refers to the pairing of treatment and control units with similar values on the propensity score; and possibly other covariates (the characteristics of participants); and the discarding of all unmatched units.

PSM is done on observational studies. It is done to remove the selection bias between the treatment and the control groups.

For example say a researcher wants to test the effect of a drug on lab rats. He divides the rats in two groups and tests the effects of the drug in one of the groups, which is the treatment group. The other group is known as the control group. As it is an experiment everything is controlled by the experimenter, like all these rats are genetically the same and grow up in the same environment, so he knows that any differences between them will be solely due to the drug that he has given them.

However this cannot be done with people because people are different from each other, they come in different shapes and sizes, ages, ethnicities, etc., However, at the same time people aren’t so different that we can’t find any similarities between them. This is when we can use propensity score matching.

To give an example, if a marketer wants to observe the effect of a marketing campaign on the buyers; to judge if the campaign is the only reason which influenced them to buy; he cannot deduce this as he does not know whether the people who participated in the campaign are equivalent to the people who did not participate in the campaign. There might be other influential factors that led the buyers to buy and it might not be the effect of the campaign at all. In that case he can go for a propensity score matching estimation to observe how much impact the campaign had on the buyers/non-buyers.

I will now demonstrate a simple program on how to do Propensity Score matching in R, with the use of two packages: Tableone and MatchIt.

#Reading the raw data > Data <- read.csv("Campaign_Data.csv", header = TRUE) > dim(Data) [1] 1000 4

The dataset contains randomly simulated 1000 records of people with their demographic profiles, age and income, the Ad_Campaign_Response, whether they have responded to the campaign or not , 1 = Responded and 0 = Not Responded, and the Bought column, 1 = Purchased and 0 = Not Purchased.

# To view the first few records in the dataset > head(Data)Age Income Ad_Campaign_Response Bought1 41 107 112 56 75 003 50 88 004 22 94 105 51 74 106 45 51 11

# The treatment population > Treats <- subset(Data, Ad_Campaign_Response == 1) > colMeans(Treats)Age IncomeAd_Campaign_Response Bought45.7500000 79.2772277 1.00000000.8391089# The control population > Control <- subset(Data, Ad_Campaign_Response == 0) > colMeans(Control)Age IncomeAd_Campaign_Response Bought45.0755034 79.4832215 0.00000000.1107383> dim(Treats) [1] 404 4 > dim(Control) [1] 596 4 #Here we see we have 404 treated records and 596 control records.

If we want to find out the real impact of the campaign on the purchase

> model_1 <- lm(Bought ~ Ad_Campaign_Response + Age + Income,data=Data) > model_1Call:lm(formula = Bought ~ Ad_Campaign_Response + Age + Income, data = Data)Coefficients:(Intercept) Ad_Campaign_Response Age Income0.2201474 0.7291988-0.0014048 -0.0005798> Effect <- model_1$coeff[2] > EffectAd_Campaign_Response0.7291988

The model above shows that the ad campaign had a 72.9% effect on the purchase.

Now let’s prepare a Logistic Regression model to estimate the propensity scores. That is, the probability of responding to the ad campaign.

> pscores.model <- glm(Ad_Campaign_Response ~ Age + Income,family = binomial("logit"),data = Data) > summary(pscores.model)Call:glm(formula = Ad_Campaign_Response ~ Age + Income, family = binomial("logit"),data = Data)Deviance Residuals:Min 1Q Median 3Q Max-1.1095 -1.0208 -0.9926 1.3373 1.4521Coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) -0.6332998 0.4505631 -1.406 0.160Age 0.0065209 0.0063705 1.024 0.306Income -0.0006507 0.0041133 -0.158 0.874(Dispersion parameter for binomial family taken to be 1)Null deviance: 1349.2 on 999 degrees of freedomResidual deviance: 1348.1 on 997 degrees of freedomAIC: 1354.1Number of Fisher Scoring iterations: 4> Propensity_scores <- pscores.model > Data$PScores <- pscores.model$fitted.values > hist(Data$PScores[Ad_Campaign_Response==1],main = "PScores of Response = 1") > hist(Data$PScores[Ad_Campaign_Response==0], ,main = "PScores of Response = 0")

#covariates we are using >xvars <- c("Age", "Income") >library(tableone) > table1 <- CreateTableOne(vars = xvars,strata = "Ad_Campaign_Response",data = Data, test = FALSE)

We will use the tableone package to summarize the data using the covariates that we stored in xvars. We are going to stratify the response variable, i.e., we will check the balance in the dataset among the people who responded (treatment group) and who did not respond (control group) to the campaign. The ‘test = False’ states that we don’t require a significance test; instead we just want to see the mean and standard deviation of the covariates in the results.

Then we are going to print the statistics and also see the Standardized Mean Differences (SMD) in the variables.

> print(table1, smd = TRUE)Stratified by Ad_Campaign_Response01SMDn596404Age (mean (sd)) 45.08 (10.02)45.75 (10.33)0.066Income (mean (sd)) 79.48 (15.80)79.28 (15.55)0.013

Here we see the summary statistics of the covariates, we can see that in the control group there are 596 subjects and in the treatment group there are 404 subjects. It also shows the mean and standard deviations of these variables.

In the last column we can see the SMD, here we should be careful about SMDs which are greater than 0.1 because those are the variables which shows imbalance in the dataset and that is where we actually need to do propensity score matching.

In our example here, as it is a simulated dataset of just 1000 records and containing only 2 covariates we don’t see any imbalance but we will still go ahead with the matching and see the difference in the results.

We can now proceed with the matching algorithms with our ‘pscores.model’ and the estimated propensity scores. The matching algorithms create sets of participants for treatment and control groups. A matched set will consist of at least one person from the treatment group (i.e., people who responded to the ad campaign) and one from the control group (i.e., people who did not respond to the ad campaign) with similar propensity scores. The basic goal is to approximate a random experiment, eliminating many of the problems that come with observational data analysis.

There are various matching algorithms in R, namely, *exact matching, nearest neighbor, optimal matching, full matching *and* caliper matching.* Let’s try a couple of them:

1. *Exact matching* is a technique used to match individuals based on the exact values of covariates.

> library(MatchIt) > match1 <- matchit(pscores.model, method="exact",data=Data) > summary(match1, covariates = T)Call:matchit(formula = pscores.model, data = Data, method = "exact")Sample sizes:ControlTreatedAll596404Matched 99104Discarded 497 300Matched sample sizes by subclass:TreatedControl Total Age Income1 1 12 47 952 2 13 41 893 1 12 35 684 1 12 38 785 1 12 59 69> match1.data <- match.data(match1) > View(match1.data)

Here we can see that atleast one person in the treatment group (Ad_Campaign_Response = 1) has been matched with one person in the control group (Ad_Campaign_Response = 0).

> table_match1 <- CreateTableOne(vars = xvars,strata = "Ad_Campaign_Response",data = match1.data,test = FALSE) > print(table_match1, smd = TRUE)Stratified by Ad_Campaign_Response0 1SMDn 99104Age (mean (sd)) 46.04 (7.57)45.51 (7.33)0.071Income (mean (sd)) 79.45 (10.43)79.82 (10.23) 0.035

We can see in the above results that there are 99 control subjects matched with 104 treatment subjects. As our sample dataset is fairly balanced we don’t see much difference in doing exact matching, instead the SMD numbers are slightly higher than the pre-matching numbers. We can try the same with nearest matching and see the effect of matching.

2. *Nearest Neighbour Matching* is an algorithm that matches individuals with controls (it could be two or more controls per treated unit) based on a distance.

> match2 <- matchit(pscores.model, method="nearest", radio=1,data=Data) > plot(match2, type="jitter")

> plot(match2, type="hist")

> table_match2 <- CreateTableOne(vars = xvars,strata = "Ad_Campaign_Response",data = match2.data,test = FALSE) > print(table_match2, smd = TRUE)Stratified by Ad_Campaign_Response01SMDn404404Age (mean (sd)) 45.63 (10.26)45.75 (10.33) 0.012Income (mean (sd)) 79.27 (16.04)79.28 (15.55) <0.001

The first thing we see above is that we matched 404 controls to 404 treated subjects and also we can see we get very small numbers for SMD in the covariates, hence we can conclude that we did a pretty good job of matching with nearest neighbour. We have a great balance in the dataset now to proceed with our further analysis.

We will now do a t test to test our hypothesis that there is a higher chance of purchase of the product when people respond to the ad campaign.

Before performing a paired t-test we will have to create two subsets from the matched data, one for treatment and other for control groups.

> y_trt <- match2.data$Bought[match2.data$Ad_Campaign_Response == 1] > y_con <- match2.data$Bought[match2.data$Ad_Campaign_Response == 0]

Next we will calculate a pairwise difference of the two subsets

> difference <- y_trt - y_con

Then we will perform a paired t-test on the difference, a paired t-test is just a regular t test on the difference in the outcome of the matched pairs

> t.test(difference)One Sample t-testdata: differencet = 30.786, df = 403, p-value < 2.2e-16alternative hypothesis: true mean is not equal to 095 percent confidence interval:0.6835709 0.7768252sample estimates:mean of x0.730198

From the results above we see a very very small p-value that makes the model highly significant.

The point estimate (mean) is 0.73, that means the difference in probability of Buying the product when everyone responds to the ad campaign versus when no one responds is 0.73 ( in other words there are 73% higher chances of buying when a person responds to the ad campaign).

- www.statisticshowto.com/propensity-score-matching/
- pareonline.net/getvn.asp?v=19&n=18
- rstudio-pubs-static.s3.amazonaws.com/284461_5fabe52157594320921fc9e4d539ebc2.html
- Research paper on “Propensity Score Matching in Observational Studies”.
- Inferring Causal effects from Observational data – Jason Roy

The post Propensity Score Matching in R appeared first on Perceptive Analytics.

]]>The post Optimization Using R appeared first on Perceptive Analytics.

]]>Optimization is a technique for finding out the best possible solution for a given problem for all the possible solutions. Optimization uses a rigorous mathematical model to find out the most efficient solution to the given problem.

Optimization is a technique for finding out the best possible solution for a given problem for all the possible solutions. Optimization uses a rigorous mathematical model to find out the most efficient solution to the given problem. To start with an optimization problem, it is important to first identify an objective. An objective is a quantitative measure of performance. For example: to maximize profits, minimize time, minimize costs, maximize sales.

**Unconstrained optimization**

In certain cases the variable can be freely selected within it’s full range. The optim() function in R can be used for 1- dimensional or n-dimensional problems. The general format for the optim() function is –

optim(objective, constraints, bounds = NULL, types= NULL, maximum = FALSE)

We start off with an example, let’s define the objective function what we are looking to solve –

> f <- function(x) 4 * (x[1]-1)^2 + 7 * (x[2]-3)^2 + 30 > f function(x) 4 * (x[1]-1)^2 + 7 * (x[2]-3)^2 + 30

Setting up the constraints next

> c <- c(1, 1) > c [1] 1 1

The optimization function is invoked

> r <- optim(c, f) > r $par [1] 0.9999207 3.0001660 $value [1] 30 $counts function gradient 69 NA $convergence [1] 0 $message NULL

Next we check if the optimization converged to a minimum or not. The easy way to do this is to check if

> r$convergence == 0 [1] TRUE

The optimization has converged to minimum. Finding out the input arguments for the optimization function can be obtained by

> r$par [1] 0.9999207 3.0001660

The value of objective function at the minimum is obtained by

> r$value [1] 30

Here is a good definition from technopedia* – “Linear programming is a mathematical method that is used to determine the best possible outcome or solution from a given set of parameters or list of requirements, which are represented in the form of linear relationships. It is most often used in computer modeling or simulation in order to find the best solution in allocating finite resources such as money, energy, manpower, machine resources, time, space and many other variables. In most cases, the “best outcome” needed from linear programming is maximum profit or lowest cost.*”

An example of a LP problem is –

Maximize or Minimize objective function: *f(y1, y2) = g1.y1 + g2.y2*

Subjected to inequality constraints:

*g11.y1 + g12.y2 <= p1**g21.y1 + g22.y2 <= p2**g31.y1 + g32.y2 <= p3**y1 >= 0, y2 >=0*

A company wants to maximize the profit for two products A and B which are sold at $ 25 and $ 20 respectively. There are 1800 resource units available every day and product A requires 20 units while B requires 12 units. Both of these products require a production time of 4 minutes and total available working hours are 8 in a day. What should be the production quantity for each of the products to maximize profits?

A LP problem can either be a maximization problem or a minimization problem. The Above problem is a maximization problem. Some of the steps that should be followed while defining a LP problem are –

- Identify the decision variables
- Write the objective function
- Mention the constraints
- Explicitly state the non-negativity restriction

Lets walk through the above example –

As already defined this is a maximization problem, first we define the objective function.

*max**(Sales) = **max**(25 y*_{1}* + 20 y*_{2}*)*

*where,*

- y
_{1}is the units of Product A produced - y
_{2}is the units of Product B produced - y
_{1}and y_{2}are called the decision variables - 25 and 20 are the selling price of the products

We are trying to maximize the sales while finding out the optimum number of products to manufacture. Now we set the constraints for this particular LP problem. We are dealing with both resource and time constraints.

*20y*_{1}* + 12 y*_{2}* <= 1800 **(Resource Constraint)*

*4y*_{1}* + 4y*_{2}* <= 8*60 **(Time constraint)*

- Graphical Method
- Simplex Method

We will be solving this problem using the simplex method but in R. We shall also explain another example with excel’s solver. There are a couple of packages in R to solve LP problems. Some of the popular ones are –

- lpsolve
- lpsolveAPI

Let’s use lpsolve for this problem. First we need to set the objective function, this has already been defined.

> require(lpSolve) Loading required package: lpSolve > objective.in <- c(25, 20) > objective.in [1] 25 20

Creating a matrix for the constraints, we create a 2 by 2 matrix for this. And then setting constraints.

> const <- matrix(c(20, 12, 4, 4), nrow=2, byrow=TRUE) > const [,1] [,2] [1,] 20 12 [2,] 4 4 > time_constraints <- (8*60) > resource_constraints <- 1800 > time_constraints [1] 480 > resource_constraints [1] 1800

Now we are basically creating the equations that we have already defined by setting the rhs and the direction of the constraints.

> rhs <- c(resource_constraints, time_constraints) > rhs [1] 1800 480 > direction <- c("<=", "<=") > direction [1] "<=" "<="

The final step is to find the optimal solution. The syntax for the lpsolve package is –

*lp**(direction , objective, const.mat, const.dir, const.rhs)*

> optimum <- lp(direction="max", objective.in, const, direction, rhs) > optimum Success: the objective function is 2625 > summary(optimum) Length Class Mode direction 1 -none- numeric x.count 1 -none- numeric objective 2 -none- numeric const.count 1 -none- numeric constraints 8 -none- numeric int.count 1 -none- numeric int.vec 1 -none- numeric bin.count 1 -none- numeric binary.vec 1 -none- numeric num.bin.solns 1 -none- numeric objval 1 -none- numeric solution 2 -none- numeric presolve 1 -none- numeric compute.sens 1 -none- numeric sens.coef.from 1 -none- numeric sens.coef.to 1 -none- numeric duals 1 -none- numeric duals.from 1 -none- numeric duals.to 1 -none- numeric scale 1 -none- numeric use.dense 1 -none- numeric dense.col 1 -none- numeric dense.val 1 -none- numeric dense.const.nrow 1 -none- numeric dense.ctr 1 -none- numeric use.rw 1 -none- numeric tmp 1 -none- character status 1 -none- numeric

Now we get the optimum values for y1 and y2, i.e the number of product A and product B that should be manufactured.

> optimum$solution [1] 45 75

The maximum sales figure is –

> optimum$objval [1] 2625

Below is good example taken from lpsolve’s sourceforge website –

“Suppose a farmer has 75 acres on which to plant two crops: wheat and barley. To produce these crops, it costs the farmer (for seed, fertilizer, etc.) $120 per acre for the wheat and $210 per acre for the barley. The farmer has $15000 available for expenses. But after the harvest, the farmer must store the crops while awaiting avourable market conditions. The farmer has storage space for 4000 bushels. Each acre yields an average of 110 bushels of wheat or 30 bushels of barley. If the net profit per bushel of wheat (after all expenses have been subtracted) is $1.30 and for barley is $2.00, how should the farmer plant the 75 acres to maximize profit?”

As we did in the previous example, let’s define the optimization function and the constraints.

*maximize*

* g = (110)(1.30)x + (30)(2.00)y = 143x + 60y*

*subject to*

* 120x + 210y <= 15000*

* 110x + 30y <= 4000*

* x + y <= 75*

* x >= 0*

* y >= 0*

We will be solving this problem in both excel and R.

There is a way to solve LP problems using Excel. Firstly we have to define the number of acres to plant for both these crops.

Initially they are zero, the optimum values can be obtained by using the solver plugin in excel. Next we need to define the constraints, the equations for the constraints have already been defined.

Subject to,

The initial values for the number of used resources is set at zero initially. The main workhorse behind these LP problems is to use the sumproduct equation.

The sumproduct equation is basically a product of cost multiplied by the number of acres.

Now we define the maximization equation. In this case the profit needs to be maximized. The total profit cell that is seen in the below image.

The same sumproduct formula that was previously used is repeated again.

Now try changing the values for the wheat and Barley cells. Just put some random number for both the cells. Observe the total profit cell and the used cell for all the constraints.

As seen we get a total profit figure, however have a look at the used resources. Clearly the values in the cell are more than the values in the available cells. We can try to decrease the number of acres and see what the resulting values will be.

The values are low indeed, but it is not the optimum solution. To get the optimum values for total profit. We use the solver plugin to solve this. Navigate to the data and see if there is the solver option under the analysis tab.

If it is not present, follow these steps to install it –

- Navigate for the file tab and then click on options
- Click on the Add-ins button, and then go next to the manage

Once this is done, a popup should appear.

Select the Solver add-in and press OK.

There you go now the solver option should show up. Before we proceed further, let’s set the number of acres to zero.

Now, set the objective to total profit

Set the –

- Solving method to simplex LP.
- The objective to max

Next add the three constraints,

Once done press solve, the following popup should show up along with the optimized values for the profit cell.

Press OK, now we have the number of acres to plant for Wheat and Barley. Along with the profit value.

Implementation in R using LpsolveAPI

We previously used the Lpsolve package, in this example we illustrate how to use LpsolveAPI package. We start off by calling the lpsolveapi package.

> require(lpSolveAPI) Loading required package: lpSolveAPI > lprec <- make.lp(0, 2) > lprec Model name: C1 C2 Minimize 0 0 Kind Std Std Type Real Real Upper Inf Inf Lower 0 0

Next we set the objective function to maximization.

> lp.control(lprec, sense="max") $anti.degen [1] "none" $basis.crash [1] "none" $bb.depthlimit [1] -50 $bb.floorfirst [1] "automatic" $bb.rule [1] "pseudononint" "greedy" "dynamic" "rcostfixing" $break.at.first [1] FALSE $break.at.value [1] 1e+30 $epsilon epsb epsd epsel epsint epsperturb epspivot 1e-10 1e-09 1e-12 1e-07 1e-05 2e-07 $improve [1] "dualfeas" "thetagap" $infinite [1] 1e+30 $maxpivot [1] 250 $mip.gap absolute relative 1e-11 1e-11 $negrange [1] -1e+06 $obj.in.basis [1] TRUE $pivoting [1] "devex" "adaptive" $presolve [1] "none" $scalelimit [1] 5 $scaling [1] "geometric" "equilibrate" "integers" $sense [1] "maximize" $simplextype [1] "dual" "primal" $timeout [1] 0 $verbose [1] "neutral"

We define the objective function and the constraints

> set.objfn(lprec, c(143, 60)) > add.constraint(lprec, c(120, 210), "<=", 15000) > add.constraint(lprec, c(110, 30), "<=", 4000) > add.constraint(lprec, c(1, 1), "<=", 75)

Let’s have a look at the LP problem that we have defined.

> lprec Model name: C1 C2 Maximize 143 60 R1 120 210 <= 15000 R2 110 30 <= 4000 R3 1 1 <= 75 Kind Std Std Type Real Real Upper Inf Inf Lower 0 0

Solving

> solve(lprec) [1] 0

Getting the optimized profit value

> get.objective(lprec) [1] 6315.625

Let’s finally get the values for how many acres of Wheat and Barley to be planted.

> get.variables(lprec) [1] 21.875 53.125

Thus, to achieve the maximum profit ($6315.625), the farmer should plant 21.875 acres of wheat and 53.125 acres of barley.

Now that we have figured out how to solve LP problems using excel solver as well as the packages in R, namely Lpsolve and LpsolveAPI. Go ahead and start solving your own Linear Programming problems.

LP Problem sources –

The post Optimization Using R appeared first on Perceptive Analytics.

]]>The post Building Regression Models in R using Support Vector Regression appeared first on Perceptive Analytics.

]]>The article studies the advantage of Support Vector Regression (SVR) over Simple Linear Regression (SLR) models for predicting real values, using the same basic idea as Support Vector Machines (SVM) use for classification.

The article studies the advantage of Support Vector Regression (SVR) over Simple Linear Regression (SLR) models. SVR uses the same basic idea as Support Vector Machine (SVM), a classification algorithm, but applies it to predict real values rather than a class. SVR acknowledges the presence of non-linearity in the data and provides a proficient prediction model. Along with the thorough understanding of SVR, we also provide the reader with hands on experience of preparing the model on R. We perform SLR and SVR on the same dataset and make a comparison. The article is organized as follows; Section 1 provides a quick review of SLR and its implementation on R. Section 2 discusses the theoretical aspects of SVR and the steps to fit SVR on R. It also covers the basics of tuning SVR model. Section 3 is the conclusion.

Simple Linear Regression (SLR) is a statistical method that examines the linear relationship between two continuous variables, X and Y. X is regarded as the independent variable while Y is regarded as the dependent variable. SLR discovers the best fitting line using Ordinary Least Squares (OLS) criterion. OLS criterion minimizes the sum of squared prediction error. Prediction error is defined as the difference between actual value (*Y*) and predicted value (*Ŷ*) of dependent variable. OLS minimizes the squared error function defined as follows:

SLR minimizes the *Squared Errors (SE)* to optimize the parameters of a linear model, αi and βi, thereby computing the best-fit line, which is represented as follows:

Let us perform SLR on a sample data, with a single independent variable. We treat *X* as the independent variable and *Y* as the dependent variable*. *The data is in *.csv* format and can be downloaded by clicking here.

Now we use R to perform the analysis. The R script is provided side by side and is commented for better understanding of the reader. We start with the scatter plot shown in Figure 1. Please set the working directory in R using setwd( ) function and keep sample data in the working directory.

## Prepare scatter plot #Read data from .csv file data=read.csv("SVM.csv", header=T) head(data) #Scatter Plot plot(data, main ="Scatter Plot")

The first step is to visualize the data to obtain basic understanding. The scatter plot suggests negative relationship between *X* and *Y*. Let us try fitting line on the scatter plot using Ordinary Least Squares (OLS) method.

## Add best-fit line to the scatter plot #Fit linear model using OLS model=lm(Y~X,data) #Overlay best-fit line on scatter plot abline(model)

We expect a negative relationship between X and Y. Equation (3) represents the linear model fitting our sample data. The values of *Y*, dependent variable, are obtained by plugging in the given values of *X*, independent variable.

Figure 2 shows the best-fit line of our data set. It can be observed that a linear fit is not able to capture the complete relationship between *X* and *Y*. In fact, no model can capture the complete relationship in a statistical relation. The idea is to strive for a reasonable prediction. The next step would be to evaluate the fitted model. One of the widely used methods for assessing statistical models is Root Mean Square Error (RMSE). It quantifies the performance of a regression model. It measures the root of mean of squared errors and is calculated as shown in equation (4). The lower value of RMSE implies that the prediction is close to actual value, indicating a better predictive accuracy.

Before calculating RMSE for our example, let us look at the predicted values as estimated by the linear model. The actual values are shown in black while the predicted values are show in blue in Figure 3. The R code is as follows:

## Scatter plot displaying actual values and predicted values #Scatter Plot plot (data, pch=16) #Predict Y using Linear Model predY <- predict (model, data) #Overlay Predictions on Scatter Plot points (data$X, predY, col = "blue", pch=16)

Figure 3 provides a better understanding of RMSE. Ŷ_{i} and Y_{i} in RMSE (equation 3) are blue and black dots respectively. The vertical distance between black and corresponding blue dot is the error term (ϵ_{i}). Let us now calculate RMSE for the linear model. In order to calculate RMSE in R, “hydroGOF” package is required. R code is as follows:

## RMSE Calculation for linear model #Install Package install.packages("hydroGOF") #Load Library library(hydroGOF) #Calculate RMSE RMSE=rmse(predY,data$Y)

The computation using above R code shows RMSE to be 0.94 for the linear model. The absolute value of RMSE does not reveal much, but a comparison with alternate models adds immense value. We will try to improve RMSE using Support Vector Regression (SVR) but before that let us understand the theoretical aspects of SVR.

Support Vector Regression (SVR) works on similar principles as Support Vector Machine (SVM) classification. One can say that SVR is the adapted form of SVM when the dependent variable is numerical rather than categorical. A major benefit of using SVR is that it is a non-parametric technique. Unlike SLR, whose results depend on Gauss-Markov assumptions, the output model from SVR does not depend on distributions of the underlying dependent and independent variables. Instead the SVR technique depends on kernel functions. Another advantage of SVR is that it permits for construction of a non-linear model without changing the explanatory variables, helping in better interpretation of the resultant model. The basic idea behind SVR is not to care about the prediction as long as the error (ϵ_{i}) is less than certain value. This is known as the *principle of maximal margin*. This idea of maximal margin allows viewing SVR as a convex optimization problem. The regression can also be penalized using a cost parameter, which becomes handy to avoid over-fit. SVR is a useful technique provides the user with high flexibility in terms of distribution of underlying variables, relationship between independent and dependent variables and the control on the penalty term. Now let us fit SVR model on our sample data. R package “e1071” is required to call *svm* function. R code is as follows:

## Fit SVR model and visualize using scatter plot #Install Package install.packages("e1071") #Load Library library(e1071) #Scatter Plot plot(data) #Regression with SVM modelsvm = svm(Y~X,data) #Predict using SVM regression predYsvm = predict(modelsvm, data) #Overlay SVM Predictions on Scatter Plot points(data$X, predYsvm, col = "red", pch=16)

The white dots ad the red dots represent actual values and predicted values respectively. At first glance, the SVR model looks much better compared to SLR model as the predicted values are closer to the actual values. To obtain to a better understanding, let us try to understand and represent the constructed model.

SVR technique relies on kernel functions to construct the model. The commonly used kernel functions are: a) Linear, b) Polynomial, c) Sigmoid and d) Radial Basis. While implementing SVR technique, the user needs to select the appropriate kernel function. The selection of kernel function is a tricky and requires optimization techniques for the best selection. A discussion on kernel selection is outside the scope of discussion for this article. In the constructed SVR model, we used the automated kernel selection provided by R. Radius Basis Function (RBF) kernel is used in the above model. Given a non-linear relation between the variables of interest and difficulty in kernel selection, we would suggest the beginners to use RBF as the default kernel. The kernel function transforms our data from non-linear space to linear space. The kernel trick allows the SVR to find a fit and then data is mapped to the original space. Now let us represent the constructed SVR model:

The value of parameters W and b for our data is -4.47 and -0.06 respectively. The R code to calculate parameters is as follows:

##Calculate parameters of the SVR model #Find value of W W = t(modelsvm$coefs) %*% modelsvm$SV #Find value of b b = modelsvm$rho

We have learnt that the real value of RMSE lies is comparison of alternative models. In the SVR model, the predicted values are closer to the actual values, suggesting a lower RMSE value. RMSE calculation would allow us to compare the SVR model with the earlier constructed linear model. A lower value of RMSE for SVR model would confirm that the performance of SVR model is better than that of SLR model. The R code for RMSE calculation is as follows:

## RMSE for SVR Model #Calculate RMSE RMSEsvm=rmse(predYsvm,data$Y)

RMSE for SVR model is 0.433, much lower than 0.94 computed earlier for the SLR model. By default *svm* function in R considers maximum allowed error (ϵ_{i}) to be 0.1. In order to avoid over-fitting, the *svm* SVR function allows us to penalize the regression through cost function. The SVR technique is flexible in terms of maximum allowed error and penalty cost. This flexibility allows us to vary both these parameters to perform a sensitivity analysis in attempt to come up with a better model. Now we will perform sensitivity analysis, by training a lot of models with different allowable error and cost parameter. This process of searching for the best model is called tuning of SVR model. The R code for tuning of SVR model is as follows:

## Tuning SVR model by varying values of maximum allowable error and cost parameter #Tune the SVM model OptModelsvm=tune(svm, Y~X, data=data,ranges=list(elsilon=seq(0,1,0.1), cost=1:100)) #Print optimum value of parameters print(OptModelsvm) #Plot the perfrormance of SVM Regression model plot(OptModelsvm)

The above R code tunes the SVR model by varying maximum allowable error and cost parameter. The *tune* function evaluates performance of 1100 models (100*11) i.e. for every combination of maximum allowable error (0 , 0.1 , . . . . . 1) and cost parameter (1 , 2 , 3 , 4 , 5 , . . . . . 100). The OptModelsvm has value of epsilon and cost at 0 and 100 respectively. The plot below visualizes the performance of each of the model. The legend on the right displays the value of Mean Square Error (MSE). MSE is defined as (RMSE)^{2} and is also a performance indicator.

The best model is the one with lowest MSE. The darker the region the lower the MSE, which means better the model. In our sample data MSE is lowest at epsilon – 0 and cost – 7. We do not have to do this step manually, R provides us with the best model from the set of trained models. The R code to select the best model and compute RMSE is as follows:

## Select the best model out of 1100 trained models and compute RMSE #Find out the best model BstModel=OptModelsvm$best.model #Predict Y using best model PredYBst=predict(BstModel,data) #Calculate RMSE of the best model RMSEBst=rmse(PredYBst,data$Y)

The RMSE for the best model is 0.27, which is much lower than 0.43, RMSE of earlier fitted SVR model. We have successfully tuned the SVR model.

The next step is to represent the tuned SVR model. The value of parameters W and b the tuned model is -5.3 and -0.11 respectively. The R code to calculate parameters is as follows:

##Calculate parameters of the Best SVR model #Find value of W W = t(BstModel$coefs) %*% BstModel$SV #Find value of b b = BstModel$rho

A comparison of RMSE for the constructed SVR models, SVR and tuned SVR helps us to select the best model. Let us now visualize both these models in a single plot to enhance our understanding. Figure 7 displays the combined plot. The R code is as follows:

## Plotting SVR Model and Tuned Model in same plot plot(data, pch=16) points(data$X, predYsvm, col = "blue", pch=3) points(data$X, PredYBst, col = "red", pch=4) points(data$X, predYsvm, col = "blue", pch=3, type="l") points(data$X, PredYBst, col = "red", pch=4, type="l")

The tuned SVR model (red) fits the sample data much better than earlier developed SVR model. We hope that this tutorial on Support Vector Regression (SVR) was useful. Please download the complete code by clicking here. We have provided code for each of the steps.

The article discusses the basic concepts of Simple Linear Regression (SLR) and Support Vector Regression (SVR). SVR is a useful and flexible technique, helping the user to deal with the limitations pertaining to distributional properties of underlying variables, geometry of the data and the common problem of model overfitting. The choice of kernel function is critical for SVR modeling. We recommend beginners to use linear and RBF kernel for linear and non-linear relationship respectively.

We observe that SVR is superior to SLR as a prediction method. SLR cannot capture the nonlinearity in a dataset and SVR becomes handy in such situations. We find that SVR provides a good fit on nonlinear data. We also compute Root Mean Square Error (RMSE) for both SLR and SVR models to evaluate performance of the models.

Further, we explain the idea of tuning SVR model. Tuning of SVR model can be performed as the technique provides flexibility with respect to maximum error and penalty cost. Tuning the model is extremely important as it optimizes the parameters for best prediction. In the end we compare the performance of SLR, SVR and tuned SVR model. As expected, the tuned SVR model provides the best prediction.

The post Building Regression Models in R using Support Vector Regression appeared first on Perceptive Analytics.

]]>The post Tips for Getting Started with Text Mining in R and Python appeared first on Perceptive Analytics.

]]>There is so much of information lying in the text posts made by you and me and all others about all the trending topics today. Being in our respective firms, big or small, each of us collect some data related to our respective businesses and store it to analyze for various projects. At the same time, we all need this ‘unstructured data’ to know and understand more about our clients, customers and the state of our company in the world today. However, working with this data is not easy. The data is not structured, every piece does not have all the information and each part is unique. This is how textual data is. It needs to be processed first and converted in a form that is suitable for analysis. This is very similar to our own databases which we create except that they cannot be used directly and the amount of data is very large. This article opens up the world of text mining in a simple and intuitive way and provides great tips to get started with text mining.

The mammoth of text mining can become a simple task if you work on it with a plan in mind. Think what you need to do with text before going all out on it. What is your objective behind text mining? What sources of data do you want to use? How much data do you need for it to be sufficient? How do you plan to present your results from the data? It is all about getting curious about your problem and break it into small fragments. Thinking through the problem also opens up your mind towards the various situations you may encounter and ways to tackle those situations. You can then chart out a workflow and start pursuing the task.

There is no gold standard procedure for text mining. You have to choose the method which is most convenient for text mining. This is where factors such as efficiency,effectiveness type of problem and other factors come into play and helps you decide the best candidate for your problem. After having decided your chosen path, you need to build your knowledge and skills in developing skills in that language. I find the text mining techniques more intuitive in Python than in R but R has some handy functions to do tasks such as word counting and is richer in terms of packages available for text mining.

- The usual process of text mining involves the following steps:
- Collect data; either from social media such as twitter or other websites. Write your code that can adjust to the specific type of text you collect and store it
- Convert your data into readable text
- Remove special characters from the text (such as hashtags). You can add a hashtag count feature if that is required
- Removing numbers from the text data (unless the problem requires numbers)
- Deciding whether to keep all the data or remove some of it such as all non-English text
- Converting all the text to uppercase or lowercase only to ease analysis
- Removing stop words.. Words that have no use in your analysis. This includes articles, conjunctions, etc.
- Using word stemming and grouping similar words such as ‘keep’ and ‘keeping’ are same words used in different tense form.
- Final analysis of the processed stemmed words and visualize results

The steps are short and simple but they all depend on the first step executed well. You need to collect your data so that text mining can be performed on it. There are many ways to collect data. One of the most popular sources to collect data from is Twitter. Twitter has exposed some APIs so that tweets can be mined using both R and Python. Besides twitter, one can capture data from any website today including e-commerce websites, movie websites, song websites, etc. Some websites also contain preformatted repositories of text data such as project gutenberg, corpora, etc. Google trends and yahoo also offer some analysis online.

Based on the tools and your project objective, you may use a different approach to convert your collected text to data. If you are using R, packages such as twitteR, tm and stringr are what you may be using for most of the preprocessing. The nltk library and Tweepy package are the equivalent packages in Python. Whichever language and package you use, make sure that you have enough resources and memory to handle the data. Text mining can be cumbersome just because of the irrelavant text lying around in your data even after removing stop words. Using a good method to prepare data will give you a lot of useful information when you apply modelling techniques on the data.

You need to know your data before preprocessing it. Without the knowledge of how your data looks like, you might carelessly remove text which might have been useful in your analysis. There are many standard methods and dictionaries of removing stopwords and assigning importance to words but they may or may not apply to your data. For example, data about the government may include a lot of words such as ‘rule’, ‘govern’ and ‘politics’ which you may deem unnecessary and want to remove. Reviews may include lots of ‘hi’ in the beginning but may not be useful for a review dataset. It is always a good step to look at the source of data and go through some of the text to know how the process you defined for analysis is working to transform it correctly into useful information. Other ways specific to exploring text data is by creating a document term matrix. A document term matrix is a m*n matrix where the number of columns denote the total number of unique words in the entire dataset and the number of rows denote the total data points. Each cell thus represents the count of the particular word in that datapoint. This is a very large matrix and is later collapsed into term-frequency. From this document term matrix, one can count the total occurrences of each word in the dataset and that is exactly what term-frequency matrix stores. Other uses of document term matrix include knowing correlation between words, drawing a word cloud using term-frequency or predicting patterns using modelling techniques. This exploration will further give you confidence on the best way to move forward with textual data analysis.

The primary objective or every machine learning and data science project is to find patterns in the data that are otherwise hard to find. You need to look for those interesting patterns and are not a true data scientist if you’re scared of this step. It can be as simple as fitting a simple classifier to classify data points and see its performance. This will set a benchmark while giving you an idea of the predicting ability of the data. At times, the data may be biased or have a poor predictive power and data quality checks can help define this. For example, If I am collecting twitter data on the basis of hashtags, I can divide my collected data into train and test datasets keeping the hashtags as the dependent feature. If my prediction performance is not up to the mark, I need to go back a few steps and find out the cause of this low performance and then check how I am collecting data or how I am cleaning my data as the case may be. Other ways of getting patterns involve associations. For example, some data points may be related to each other while others may have a similar or opposite pattern. If tweets are being used for text mining, there can be duplicate tweets because of retweeting or debates going for or against a remark. Working with data also exposes problems such as dealing with sarcasm or comments that convey mixed expressions. Without brushing through the data, it will be difficult to know how much of your data is affected by these problems and whether you should drop such data or use some technique to handle the situation.

The problem you are trying to solve may or may not be the first text mining problem in your company but it is certainly not the first text mining problem in the world today. There are several data scientists out there who have worked on either the same or similar problem as the one you are working on and knowing what methods they followed and what they did differently will help you take your problem solving to the next level. Though not as frequent as other domains, there are several analysis and projects being done on text mining which include finding the trending topics, sentiment analysis on the trending topics, identifying remarks about your firm or product, identifying grievances and appreciations and the like. With the same data, there can be more than one problem that can be solved. Complex problems which can be explored also include NLP and topic modelling. I read about a fairly recent project in which some students predicted the next topic which a group of people will discuss based on the current conversation. There can be many such new projects which can be thought of and pursued in the area of text mining but since it is a new and hot field to work on, always refer other similar data and resources to further compliment your analysis and come up with strong insights.

As mentioned earlier, there can be a lot of problems which can be pursued using text mining and more than one problem can be solved from the same data. With so much to present, it is a good practice to come up with ways to present the results in a way that would seem attractive to people. This is why most of the text mining results are already visualized in the form of word clouds, sentiment studies and figures. There are a packages and libraries for each such task which include wordcloud, ggplot2, igraph, text2vec, networkD3 and plotly in R and Networkx, matplotlib, plotly in Python. You can also use other sophisticated tools just for visualization such as Tableau or Power BI which can help visualize your data in many more ways.

Visualizing results is not the end step in text mining projects. Since text is captured from online sources, it is constantly changing and so is the data that is captured. With the changing data comes changing insights and hence, when the project is completed and accepted, it should be continuously updated with new data and new insights. These insights can be further enriched with the rate of change. With time, the change can also be captured and used as a metric of progression. This becomes another longitudinal problem to be solved. Apart from the problems which can be pursued with text data, text mining is no easy feat. When you create a roadmap of collecting, cleaning and analyzing data, there may be several obstacles that will come your way. They can be situations when you have to decide whether to work with a single word frequency in document term matrix or use groups of words (known as n-grams) or building your own visualization method to present your results or memory management. At the same time, new projects are coming up in the area of text mining. The best way to learn is to face the problem hands on and learn from the experience of working on the problem. Hope this article provides motivation to head to the world of text and start mining insightful nuggets of information.

The post Tips for Getting Started with Text Mining in R and Python appeared first on Perceptive Analytics.

]]>The post A Solution to Missing Data: Imputation Using R appeared first on Perceptive Analytics.

]]>Handling missing values is one of the worst nightmares a data analyst dreams of. In situations, a wise analyst ‘imputes’ the missing values instead of dropping them from the data.

At times while working on data, one may come across missing values which can potentially lead a model astray. Handling missing values is one of the worst nightmares a data analyst dreams of. If the dataset is very large and the number of missing values in the data are very small (typically less than 5% as the case may be), the values can be ignored and analysis can be performed on the rest of the data. Sometimes, the number of values are too large. However, in situations, a wise analyst ‘imputes’ the missing values instead of dropping them from the data.

Think of a scenario when you are collecting a survey data where volunteers fill their personal details in a form. For someone who is married, one’s marital status will be ‘married’ and one will be able to fill the name of one’s spouse and children (if any). For those who are unmarried, their marital status will be ‘unmarried’ or ‘single’.

At this point the name of their spouse and children will be missing values because they will leave those fields blank. This is just one genuine case. There can be cases as simple as someone simply forgetting to note down values in the relevant fields or as complex as wrong values filled in (such as a name in place of date of birth or negative age). There are so many types of missing values that we first need to find out which class of missing values we are dealing with.

Missing values are typically classified into three types – MCAR, MAR, and NMAR.

**MCAR** stands for Missing Completely At Random and is the rarest type of missing values when there is no cause to the missingness. In other words, the missing values are unrelated to any feature, just as the name suggests.

**MAR** stands for Missing At Random and implies that the values which are missing can be completely explained by the data we already have. For example, there may be a case that Males are less likely to fill a survey related to depression regardless of how depressed they are. Categorizing missing values as MAR actually comes from making an assumption about the data and there is no way to prove whether the missing values are MAR. Whenever the missing values are categorized as MAR or MCAR and are too large in number then they can be safely ignored.

If the missing values are not MAR or MCAR then they fall into the third category of missing values known as Not Missing At Random, otherwise abbreviated as **NMAR**. The first example being talked about here is NMAR category of data. The fact that a person’s spouse name is missing can mean that the person is either not married or the person did not fill the name willingly. Thus, the value is missing not out of randomness and we may or may not know which case the person lies in. Who knows, the marital status of the person may also be missing!

If the analyst makes the mistake of ignoring all the data with spouse name missing he may end up analyzing only on data containing married people and lead to insights which are not completely useful as they do not represent the entire population. Hence, NMAR values necessarily need to be dealt with.

Data without missing values can be summarized by some statistical measures such as mean and variance. Hence, one of the easiest ways to fill or ‘impute’ missing values is to fill them in such a way that some of these measures do not change. For numerical data, one can impute with the mean of the data so that the overall mean does not change.

In this process, however, the variance decreases and changes. In some cases such as in time series, one takes a moving window and replaces missing values with the mean of all existing values in that window. This method is also known as method of moving averages.

For non-numerical data, ‘imputing’ with mode is a common choice. Had we predict the likely value for non-numerical data, we will naturally predict the value which occurs most of the time (which is the mode) and is simple to impute.

In some cases, the values are imputed with zeros or very large values so that they can be differentiated from the rest of the data. Similarly, imputing a missing value with something that falls outside the range of values is also a choice.

An example for this will be imputing age with -1 so that it can be treated separately. However, these are used just for quick analysis. For models which are meant to generate business insights, missing values need to be taken care of in reasonable ways. This will also help one in filling with more reasonable data to train models.

In R, there are a lot of packages available for imputing missing values – the popular ones being Hmisc, missForest, Amelia and mice. The mice package which is an abbreviation for Multivariate Imputations via Chained Equations is one of the fastest and probably a gold standard for imputing values. Let us look at how it works in R.

The mice package in R is used to impute MAR values only. As the name suggests, mice uses multivariate imputations to estimate the missing values. Using multiple imputations helps in resolving the uncertainty for the missingness.

The package provides four different methods to impute values with the default model being linear regression for continuous variables and logistic regression for categorical variables. The idea is simple!

If any variable contains missing values, the package regresses it over the other variables and predicts the missing values. Some of the available models in mice package are:

- PMM (Predictive Mean Matching) – suitable for numeric variables
- logreg(Logistic Regression) – suitable for categorical variables with 2 levels
- polyreg(Bayesian polytomous regression) – suitable for categorical variables with more than or equal to two levels
- Proportional odds model – suitable for ordered categorical variables with more than or equal to two levels

In R, I will use the NHANES dataset (National Health and Nutrition Examination Survey data by the US National Center for Health Statistics). We first load the required libraries for the session:

#Loading the mice package library(mice) #Loading the following package for looking at the missing values library(VIM) library(lattice) data(nhanes)

The NHANES data is a small dataset of 25 observations, each having 4 features – age, bmi, hypertension status and cholesterol level. Let’s see how the data looks like:

# First look at the data str(nhanes) 'data.frame': 25 obs. of 4 variables: $ age: num 1 2 1 3 1 3 1 1 2 2 ... $ bmi: num NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ... $ hyp: num NA 1 1 NA 1 NA 1 1 1 NA ... $ chl: num NA 187 187 NA 113 184 118 187 238 NA ...

The `str`

function shows us that bmi, hyp and chl has NA values which means missing values. The age variable does not happen to have any missing values. The age values are only 1, 2 and 3 which indicate the age bands 20-39, 40-59 and 60+ respectively. These values are better represented as factors rather than numeric. Let’s convert them:

#Convert Age to factor nhanes$age=as.factor(nhanes$age)

It’s time to get our hands dirty. Let’s observe the missing values in the data first. The mice package provides a function md.pattern() for this:

#understand the missing value pattern md.pattern(nhanes) age hyp bmi chl 13 1 1 1 1 0 1 1 1 0 1 1 3 1 1 1 0 1 1 1 0 0 1 2 7 1 0 0 0 3 0 8 9 10 27

The output can be understood as follows. 1’s and 0’s under each variable represent their presence and missing state respectively. The numbers before the first variable (13,1,3,1,7 here) represent the number of rows. For example, there are 3 cases where chl is missing and all other values are present. Similarly, there are 7 cases where we only have age variable and all others are missing. In this way, there are 5 different missingness patterns. The VIM package is a very useful package to visualize these missing values.

#plot the missing values nhanes_miss = aggr(nhanes, col=mdc(1:2), numbers=TRUE, sortVars=TRUE, labels=names(nhanes), cex.axis=.7, gap=3, ylab=c("Proportion of missingness","Missingness Pattern"))

We see that the variables have missing values from 30-40%. It also shows the different types of missing patterns and their ratios. The next thing is to draw a margin plot which is also part of VIM package.

#Drawing margin plot marginplot(nhanes[, c("chl", "bmi")], col = mdc(1:2), cex.numbers = 1.2, pch = 19)

The margin plot, plots two features at a time. The red plot indicates distribution of one feature when it is missing while the blue box is the distribution of all others when the feature is present. This plot is useful to understand if the missing values are MCAR. For MCAR values, the red and blue boxes will be identical.

Let’s try to apply mice package and impute the chl values:

#Imputing missing values using mice

mice_imputes = mice(nhanes, m=5, maxit = 40)

I have used three parameters for the package. The first is the dataset, the second is the number of times the model should run. I have used the default value of 5 here. This means that I now have 5 imputed datasets. Every dataset was created after a maximum of 40 iterations which is indicated by “maxit” parameter.

Let’s see the methods used for imputing:

#What methods were used for imputing mice_imputes$method age bmi hyp chl "" "pmm" "pmm" "pmm"

Since all the variables were numeric, the package used pmm for all features. Let’s look at our imputed values for chl

1 2 3 4 5 1 187 118 238 187 187 4 186 204 218 199 204 10 187 186 199 204 284 11 131 131 199 238 199 12 131 187 229 204 186 15 131 238 187 187 187 16 118 199 118 131 187 20 206 184 184 218 229 21 131 118 113 113 131 24 199 218 206 218 206

We have 10 missing values in row numbers indicated by the first column. The next five columns show the imputed values. In our missing data, we have to decide which dataset to use to fill missing values. This is then passed to complete() function. I will impute the missing values from the fifth dataset in this example

#Imputed dataset Imputed_data=complete(mice_imputes,5)

The values are imputed but how good were they? The xyplot() and densityplot() functions come into picture and help us verify our imputations

#Plotting and comparing values with xyplot() xyplot(mice_imputes, bmi ~ chl | .imp, pch = 20, cex = 1.4)

Here again, the blue ones are the observed data and red ones are imputed data. The red points should ideally be similar to the blue ones so that the imputed values are similar. We can also look at the density plot of the data.

#make a density plot densityplot(mice_imputes)

Just as it was for the xyplot(), the red imputed values should be similar to the blue imputed values for them to be MAR here.

Imputing missing values is just the starting step in data processing. Using the mice package, I created 5 imputed datasets but used only one to fill the missing values. Since all of them were imputed differently, a robust model can be developed if one uses all the five imputed datasets for modelling. With this in mind, I can use two functions – with() and pool().

The with() function can be used to fit a model on all the datasets just as in the following example of linear model

#fit a linear model on all datasets together lm_5_model=with(mice_imputes,lm(chl~age+bmi+hyp)) #Use the pool() function to combine the results of all the models combo_5_model=pool(lm_5_model)

The mice package is a very fast and useful package for imputing missing values. It can impute almost any type of data and do it multiple times to provide robustness. We can also use with() and pool() functions which are helpful in modelling over all the imputed datasets together, making this package pack a punch for dealing with MAR values.

The post A Solution to Missing Data: Imputation Using R appeared first on Perceptive Analytics.

]]>