intuition behind Naivebayes

Continuous Naive Bayes

Naive Bayes is a supervised machine learning algorithm. As the name implies it’s based on Bayes theorem. In this post, you will discover what’s happening behind the Naive Bayes classifier when you are dealing with continuous predictor variables.

Here I have used R language for coding. Let us see what’s going on behind the scenes in naiveBayes function when the features or predictor variables are continuous in nature.

Understanding Bayes’ theorem

A strong foundation on Bayes theorem as well as Probability functions (density function and distribution function) is essential if you really wanna get an idea of intuitions behind the Naive Bayes algorithm.

(You are free to skip this section if you are comfortable with Bayes’ theorem and you may jump to the next section on “How does probability is calculated in Naive Bayes?”)

Bayes’ theorem is all about finding a probability (we call it posterior probability) based on certain other probabilities which we know in advance.

As per the theorem,

P(A|B) = P(A) P(B|A)/P(B)

  • P(A|B) and P(B|A) are called the conditional probabilities where in P(A|B) means how often A happens given that B happens.
  • P(A) and P(B) are called the marginal probabilities which says how likely A or B is on its own (The probability of an event, irrespective of the outcomes of other random variables)

P(A/B) is what we are gonna predict, hence called as posterior probability also.

Now in real world we would be having many predictor variables and many class variables. For easy mapping let us call these classes as, C1, C2,…, Ck and the predictor variables (feature vectors) x1,x2,…,xn.

Then using Bayes theorem we would be measuring the conditional probability of an event with a feature vector x1,x2,…,xn belonging to a particular class Ci.

We can formulate the posterior probability P(c|x) from P(c), P(x) and P(x|c) as given below

Continuous Naive Bayes

How probability is calculated in Naive Bayes?

Usually we use the e1071 package to build a Naive Bayes classifier in R. And then using this classifier, we make some predictions on the training data.

So probability for these predictions can be directly calculated based on frequency of occurrences if the features are categorical.

But what if, there are features with continuous values? What the Naive Bayes classifier is actually doing behind the scenes to predict the probabilities of continuous data?

It’s nothing but usage of probability density functions. So here Naive Bayes is generating a Gaussian (Normal) distributions for each predictor variable. The distribution is characterized by two parameters, its mean and standard deviation. Then based on mean and standard deviation of the each predictor variable, the probability for a value to be ‘x’ is calculated using probability density function. (Probability density function gives the probability of observing a measurement with a specific value)

The normal distribution (bell curve) has density

where μ is the mean of the distribution and σ the standard deviation.

f(x) or the probability density function for a value ‘x’ can be calculated using some standard z-table calculations or in R language we have the dnorm function.

So in short once we know the distributions parameters (mean and standard deviation in case of normally distributed data) we can calculate any probability.

dnorm function in R

You can mirror what the naiveBayes function is doing by using dnorm (x, mean=, sd=) function for each class of outcomes. (remember, the class variable is categorical and features can be a mix of continuous and categorical). dnorm in R gives us the probability density function.

dnorm function in R is the back bone of continuous naiveBayes.

Understanding the intuitions behind continuous Naive Bayes – with iris data in R

Let us consider the Iris data in R language.

Iris dataset contains three plant species (setosa,viriginica,versicolor) and four features (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) measured for each sample.

First we will build the model using Naive Bayes function in e1071 package. And then given a set of features, say Sepal.Length=6.9, Sepal.Width=3.1, Petal.Length=5.0, Petal.Width=2.3 we will predict what would be the species.

So here is the complete code using naiveBayes function for predicting the species.

#Installing e1071 R Package



# Read the dataset


#studying structure of data


# Partitioning the dataset into training set and test set

split=sample.split(iris,SplitRatio =0.7)



# Fitting naïve_bayes model to training Set




classifier=naiveBayes(x = Trainingset[,-5],

y = Trainingset$Species)


# Predicting on test data

Y_Pred=predict(classifier,newdata = Testset)


# Confusion Matrix



#Probelm: given a set of features, find to which species that belongs

#defining a new set of data (features) to check the classification






Y_Pred=predict(classifier,newdata = newfeatures)


Now on executing the code you can see that the predicted species is Virginica as per naiveBayes function

And now here comes the most interesting part- what’s going on behind the scenes:

We know that Naive Bayes predict the results using probability density functions in the back end.

We are gonna straightaway find out the probabilities using dnorm function for each class variable. The result has to be same as that predicted by naiveBayes function.

For a given set of features,

  1. Based on the mean and standard deviation conditional probability would be derived.
  2. And then applying Baye’s theorem, probability for each species under the given set of predictor variables would be derived and compared against each other.
  3. The one with higher probability would be the predicted result.

Here is the complete code for using prediction by hand (with dnorm function)



# Read the dataset


#studying structure of data


# Partitioning the dataset into training set and test set

split=sample.split(iris,SplitRatio =0.7)



# Fitting naïve_bayes model to training Set




classifier=naiveBayes(x = Trainingset[,-5],

y = Trainingset$Species)

#Probelm: given a set of features, find to which species that belongs

#defining a new set of data (features) to check the classification





#Finding Class Prior Probabilities of each species

PriorProb_Setosa= mean(Trainingset$Species==’setosa’)

PriorProb_Virginica= mean(Trainingset$Species==’virginica’)

PriorProb_versicolor= mean(Trainingset$Species==’versicolor’)

#Species wise mean and standard deviation of Sepal Length

#Finding Conditional Probabilities or Likelihood or Prior Probabilities

Setosa= subset(Trainingset, Trainingset$Species==’setosa’)

Virginica= subset(Trainingset, Trainingset$Species==’virginica’)

Versicolor= subset(Trainingset, Trainingset$Species==’versicolor’)

Set=Setosa%>% summarise(mean(Sepal.Length),mean(Sepal.Width),mean(Petal.Length),mean(Petal.Width),


Vir=Virginica%>% summarise(mean(Sepal.Length),mean(Sepal.Width),mean(Petal.Length),mean(Petal.Width),


Ver=Versicolor%>% summarise(mean(Sepal.Length),mean(Sepal.Width),mean(Petal.Length),mean(Petal.Width),


Set_sl=dnorm(sl,mean=Set$`mean(Sepal.Length)`, sd=Set$`sd(Sepal.Length)`)

Set_sw=dnorm(sw,mean=Set$`mean(Sepal.Width)` , sd=Set$`sd(Sepal.Width)`)

Set_pl=dnorm(pl,mean=Set$`mean(Petal.Length)`, sd=Set$`sd(Petal.Length)`)

Set_pw=dnorm(pw,mean=Set$`mean(Petal.Width)` , sd=Set$`sd(Petal.Width)`)

#denominator would be same for all three probabilities. SO we can ignore them in calculations

ProbabilitytobeSetosa =Set_sl*Set_sw*Set_pl*Set_pw*PriorProb_Setosa

Vir_sl=dnorm(sl,mean=Vir$`mean(Sepal.Length)`, sd=Vir$`sd(Sepal.Length)`)

Vir_sw=dnorm(sw,mean=Vir$`mean(Sepal.Width)` , sd=Vir$`sd(Sepal.Width)`)

Vir_pl=dnorm(pl,mean=Vir$`mean(Petal.Length)`, sd=Vir$`sd(Petal.Length)`)

Vir_pw=dnorm(pw,mean=Vir$`mean(Petal.Width)` , sd=Vir$`sd(Petal.Width)`)

ProbabilitytobeVirginica =Vir_sl*Vir_sw*Vir_pl*Vir_pw*PriorProb_Virginica

Ver_sl=dnorm(sl,mean=Ver$`mean(Sepal.Length)`, sd=Ver$`sd(Sepal.Length)`)

Ver_sw=dnorm(sw,mean=Ver$`mean(Sepal.Width)` , sd=Ver$`sd(Sepal.Width)`)

Ver_pl=dnorm(pl,mean=Ver$`mean(Petal.Length)`, sd=Ver$`sd(Petal.Length)`)

Ver_pw=dnorm(pw,mean=Ver$`mean(Petal.Width)` , sd=Ver$`sd(Petal.Width)`)





On executing this code, you can see that the probability to be Virginica is higher than that of other two. And this implies that the given set of features belong to the class Virginica. And the same results were predicted using naiveBayes function.

A-priori probabilities and conditional probabilities

When you run the scripts in R for the continuous / numeric variables, you might have seen tables titled A-priori probabilities and conditional probabilities. A screenshot from console is given below

The table titled A-priori probabilities gives prior probability for each class (P(c)) in your training set. This gives the class distribution in the data(‘A priori’ is Latin for ‘from before’) which can be straight away calculated based on the number of occurrences as below,

P(c) = n(c)/n(S), where, P(c) is the probability of an event “c” n(c) is the number of favorable outcomes. n(S) is the total number of events in the sample space.

The given table of conditional probabilities is not showing the probabilities, but the distribution parameters (or rather the mean and standard deviation of the continuous data). Remember, if features were categorical, this table would be indicating the probability value itself.

Some Additional points to keep in mind

1. Rather than calculating the tables by hand you may just use the naiveBayes results itself

Here in the above script using dnorm, I have calculated mean and standard deviation by hand. Instead you can derive it using a simple code.

For example if you want to see the mean and standard deviation of sepal length for each species, just run this


2. Dropping the denominator (p(x)) probabilities in calculations

Have you noticed that I have dropped the denominator value in probability calculations?

Because the denominator (p(x)) would be same for all when we compare the probabilities for each class under the specified features. So we can just get rid of that. We just need to compare the top parts of the calculation. Also keep in mind that we are comparing the probabilities only and hence omitted the denominator. If we need to get the actual probability value, denominator shouldn’t be omitted.

3. What if the continuous data is not normally distributed?

There are of course other distributions like Bernoulli, multinomial etc, and not just Gausian distribution alone. But the logic behind all is the same: assuming the feature satisfies a certain distribution, estimating the parameters of the distribution, and then getting the probability density function.

4. Kernel based densities

Kernel based densities may perform better when continuous variables are not normally distributed. It might improve the test accuracy rate. While making the model input this code ‘useKernel=T’

5. Disretization strategy for continuous Naivebayes

For predicting the class-conditional probabilities for continuous attributes in naive Bayes classifiers we can use a disretization strategy also. Discretization works by breaking the data into categorical values. This approach which transforms the continuous attributes into ordinal attributes is not covered in this article at present.

6. Why Naivebayes is called Naive?

The Naive Bayes algorithm is called “Naive” because it makes the assumption that the occurrence of a certain feature is independent of the occurrence of other features while in reality, they may be dependent in some way!

covid analysis

Six plus months had elapsed since the World Health Organization declared Covid -19 as a pandemic. The daily confirmed cases are still rising, but interestingly google trend shows a lose of interest in searches related to Covid-19 recently. Maybe the initial panic has come down to a greater extent. But how long can the pandemic last? How many months more we have to live with this?

And India’s case load has become the world’s second highest. Now the question is “how many months more?

When can India get back on its feet?

With the help of worldometer data, an analysis was done using the growth/decay factor of daily new cases. By growth/decay factor, I meant the increase/decrease factor and not the % change.

As per this simple mathematical analysis, progression of Covid-19 is slowing down with time since the very beginning itself. That is, it’s not actually a growth factor, but a decay factor. (Anyway the term growth factor itself will be used till the end of this article).

Growth factor was calculated across months for the daily new confirmed cases since April 2020 (considerable cases were being reported in India since April 2020). And further growth of growth factor was also computed.

One straight observation was the approximately constant ‘growth of growth factor’ over the months. That is, ‘the increase of increase’ was not much fluctuating. Instead it was showing a somewhat constant figure.

Then, using this ‘growth of growth factor’, data points are extrapolated for future months. So as per the data, new confirmed cases might be highest somewhere in Sept-October and then it starts slowly declining.

Figure 1 reflects a sample trend of daily new confirmed cases across months.

Figure 1: Trend of daily new cases

Correlation – daily active cases Vs new cases

Using ggplot package in R, a scatter plot is generated for daily total active cases against new confirmed cases.(This plot has used data from Covid-19 package in R).

Total active cases on a day appears to be approximately ten times (especially since July 2020) of the new confirmed cases on that day. And which implies recoveries are progressing at a constant rate as of now. If any dip/delay occurs in medicare services, the total active cases would drastically increase and which would lead to a severe catastrophe.

Figure 2:Scatter plot


Hopefully India can get back on its feet by say, third quarter of 2021 with strict adherence to social distancing measures and better medicare services. Social distancing is a must as a single infected person can become a bigger vulnerability later. Even though social distancing won’t end the disease, it can save more lives.

And last, but not least,

Recovery is not actually the end of this crisis. We are yet to face the lingering impacts of Covid-19, So let us make ourselves immunized to the best way possible.


If the curve has been flattened, maybe we would have a better understanding and better predictions about the end of the pandemic. But the graphs are still rising or fluctuating.

More over we cannot expect a symmetric rise and fall of an epidemic. It could be a sharp rise and a little bit random decline after the peak. Then probably before touching the x axis, it may again surge back up and appear with another peak.

Hence I know it’s not wise to do such a forecasting especially when there are too many other factors at play like possibility of mutations happening to the virus gene, changes in testings etc.

Hence the data presented therein are purely based on my intuitions out of the mathematical analysis done and publicly available data at the time of publication.

And the information provided here are merely with an analysis purpose. I wouldn’t be responsible for any negative occurrences pertaining to the usage of this information. These reports are not peer-reviewed and therefore should not be treated as established information.

In this article, 7 statistical tests are explained which are essential for doing statistical analysis inside a CMMI High Maturity (HM) compliant project or organization.

1       Stability test


Data stability for the selected parameter is tested using minitab before making performance baselines.


  1. Go to Stat->Control Charts-> I-MR (In variables, enter the column containing the parameter)
  2. From the section for Estimate, choose ‘Average Moving Range” as methods of estimating sigma and ‘2’ as moving range of length
  3. From section of tests, choose specific tests to perform
  4. From the section for ‘Options’ enter sigma limit positions as 1 2 3.


After eliminating all the out of turn points, the system attains stability and is ready for baseline.

2       Capability test


Once the selected parameter is baselined, capability of the same to meet the specification limits are tested.


  1. Go to Stat->Quality Tools->Capability Sixpack(Normal), Choose single Column, (In variables, enter the column containing the parameter), Enter ‘1’ as subgroup size),Enter Lower spec and upper spec,
  2. From the section for Estimate choose ‘Average Moving Range” as methods of estimating sigma and ‘2’ as moving range of length,
  3. From the section for Options, enter ‘6’ as sigma tolerance, choose ‘within subgroup analysis’ and ‘percents’, and Opt ‘display graph’


If the control limits are within specification limits or the Cp and Cpk values are equal to or greater than one, the data is found to be capable.

3       Correlation test


Correlation test will be conducted between each independent parameter and the dependent parameter (if both are of continuous data type) in the Process Performance Model.


  1. Go to Stat->Basis Statistics ->Correlation (Opt display P values)


For each correlation test p-value has to be less than 0.05 (or the decided p value within the organization based on risk analysis)

4       Regression test


Regression test will be conducted including all the independent parameters and the dependent parameter in the Process Performance Model.


  1. Go to Stat->regression->regression; (In response and predictors, enter the columns containing the dependent and independent parameters respectively)
  2. From the section for storage, include Residuals also


  • p-value has to be less than 0.05 for each factor as well as for the regression equation obtained. (Or the decided p value within the organization based on risk analysis)
  • [R-Sq (adj)] has to be greater than 70 %( or the decided value within the organization based on risk analysis) for ensuring the correlation between the independent parameters and the dependent parameter. Otherwise, the parameter cannot be taken.
  • Variance Inflation Factor (VIF) has to be less than 10. If VIF is greater than 10, correlation test (stat->basic statistics->correlation) will be conducted among the different parameters which are influencing Process Performance Model. In cases where    correlation is high i.e. correlation greater than 0.5 or -0.5, the factors have dependency. In such cases if degree of correlation is quite high one of the factors will be avoided or relooked for new terms.

5       Normality test


Normality of the data is tested using the Anderson-Darling test.


  1. Go to Stat > Basic Statistics > Normality Test> Anderson-Darling test
  2. In Variables, enter the columns containing the measurement data.


For the data to be normally distributed, null hypothesis cannot be rejected. For this p value has to be greater than 0.05 (or the decided p value within the organization based on risk analysis) and A2 value has to be less than .757.

6       Test for Two Variances


Test for Two Variances is conducted to analyse whether variances are significantly different in two sets of data.

This null hypothesis is tested against the alternate hypothesis (two samples are having unequal variance)


  1. Go to Stat > Basic Statistics > 2 Variances.
  2. Opt ‘Samples in Different Columns’. In Variables, enter the columns containing the measurement data

Results: If the test’s p-value is less than the chosen significance level (normally 0.05), null hypothesis will be rejected.

7       Two sample T- Test


Two sample T test is used to check whether means are significantly different in two periods for two groups of data.

The null hypothesis is checked against one of the alternative hypotheses, depending on the situation.


  1. Go to Stat > Basic Statistics > 2 Sample T
  2. Opt ‘Samples in Different Columns’. In Variables, enter the columns containing the measurement data.( First should be the initial data and second should be current data)
  3. Check or uncheck the box for “Assume equal variances” depending upon the F test results (Two variance Test results)
  4. In the Options, use the required alternative, whether ‘not equal’, ‘less than’ or ‘greater than’.
  5. Put test difference as 0 and confidence interval as 95.


If the test’s p-value is less than the chosen significance level (normally 0.05), null hypothesis has to be rejected.

Sub processes are components of a larger defined process. For example, a typical development process may be defined in terms of sub processes such as requirements development, design, build, review and test. The sub processes themselves may be further decomposed into other sub processes and process elements. Measurable parameters are defined for these sub processes to analyse the performance of the sub processes. These sub processes are further studied to identify the critical sub processes which are influencing the process performance objectives i.e. PPO. Measurable objectives are set for the critical sub process measures also. PPOs are derived fromBusiness Objectives (BOs).

In the above paragraph, there is a linkage established starting from sub process to BOs. In fact in an organization

Baselines are derived statistically using performance data collected over a period of time. They are indicators of current performance of an organization. Hence proper attention must be paid while deriving baselines as an error can cause even a loss of a business. There are some critical, but common mistakes observed in the baselining process as explained below. Crucial steps must be taken to avoid such mistakes.

Pitfall #1: Inapt parameter for baselining.

Organization must plan and define measures that are tangible indicators of process performance. Baselining does not simply imply gathering and baselining the entire set of data available in the organization. Based on the business objectives, the critical processes of the organization whose performance needs to be analyzed is selected. Then process parameters for monitoring the same are defined, collected data and finally baselining done. There is no harm in collecting and baselining the entire parameters defined in the organization, but why should we waste our time collecting data which won’t be used.

Pitfall #2: Not chronological data.

For baselining with control charts, it is essential that the data to be chronological. Hence during data collection itself, time stamp of the data must be noted.

Pitfall #3: Lack of enough number of data points.

In software industry, often we hear complaints from baselining team regarding the deficiency of data points. And when the question is put on project team, they tell like “we just don’t have time” or “it is too difficult”. In order to derive baselines there needs to be a minimum number of data points, say like 10 or so. Then only, at least all the 4 rules of stability can be applied over the data points. But in a software industry people try to build baselines with 8 or less data points. Then it won’t indicate the correct performance level of the process under investigation. In such cases where number of data points is insufficient, baselining needs to be postponed. Or organization can plan to collect more samples by increasing the frequency of data collection.

Pitfall #4: Being inconsistent.

While collecting as well as baselining data, one must use consistent methods and processes. What is being measured in the post baseline data needs to be same as what was measured in the baseline data collection process.

Pitfall #5: Taking non homogeneous data

Data taken for baselining needs to be of homogenous nature. Otherwise the baselining output won’t give the correct indication of process performance. The data can be categorized based on the qualitative parameters like type of project, complexity of the work, nature of development, programming languages etc. instead of clubbing it altogether and thereby leading to loss the homogeneity

Pitfall #6: Absence of data verification.

Usually it is a common mistake to take data blindly from organizational database and start the baselining process. Essentially, data must be verified to ensure its completeness, correctness and consistency before any statistical processing.

Pitfall #7: Not representative sample.

Processes that permits self-selection by respondents aren’t random samples and often aren’t representative of the target population. In order to have a random, representative sample, it has to be ensured that it’s truly random and representative.

Pitfall #8: Basing the baseline value on assumptions, not real data.

People have a tendency to believe that the collected data follows a normal distribution. Sometimes they don’t even check the normality statistically. Another case is like, even after data is found to be non- normal statistically, people try to make it normal by removing some data points. It is logical to remove one or two points out of 15 to 20 points, if there are some assignable reasons. Other than that it is not a good practice, to simply remove the data points in order to make the distribution normal. It is essential to check the actual distribution of the data before going ahead with baselining. Control charts work on a normal data set only. One can check the distribution of the data visually using histograms or so, and can confirm the distribution statistically using some other tools (there are a plenty of excel addins to check the distribution).

Pitfall #9: Ignoring the past Data if there is no process change.

Suppose in an organization yearly baselining is done. In the start of the year 2013 baselines were derived using data points in the previous year, say 2012. Objectives were set to ‘maintain the current process performance’ and no higher targets. And hence no improvement initiatives were triggered to raise the performance level. Next year, data points in the year 2013 were collected for baselining and it was confirmed statistically that both sets of data were equal (data points in 2012 and those in 2013), may the results from a 2 sample T test. Now which data set is taken by the organization for 2014 baselining? It is a common mistake to ignore the 2012 data and do the baselining with 2013 data points alone. Since both sets of data points were similar and statistically equal, both set must be combined in the chronological order while baselining.

Pitfall #10: Blindly taking p value as 0.05

Null hypothesis is rejected if p value is less than a significant level. In the industry, usually the significant level of P is taken as 0.05. Actually P value is an arbitrary value. Higher the p values means; risk attached with it is increasing as we reject a null hypothesis when it was actually true. (Refer more details of p value in the blog hypothesis test ) And it is up to the organization to decide that significant level.

Pitfall #11: Removing out of turn points when there is no assignable causes

Out of turn points cannot be removed if there are no assignable reasons behind it. If there is no reason for an out of turn point, it implies that data is not stable and one cannot go ahead with baselining.

Pitfall #12: Placing unfeasible values as control limits

Sometimes the control limits derived statistically during baselining process may be unworkable. Say for example a baseline of review effectiveness data (in %) cannot have an upper control limit (UCL) as 120% even though statistically it is correct. Similarly a coding speed baseline cannot have a lower control limits (LCL) as -15 lines of code/hr. All such values are unusable. So an organization needs to have a policy to handle such situations. Say for example, an organization can use 25th and 75th percentiles of the stable data as control limits in such scenario. Or organization can decide to change the LCL/UCL to the minimum/maximum permissible value of that parameter. i.e. organization can change the LCL of coding speed as ‘zero’ instead of a negative value and UCL of review effectiveness as 100% in the above examples.

Pitfall #13: Stating the baseline without contextual information

Stating the context description involves a consistent understanding of the result of the measurement process. Contextual information refers to the additional data related to the environment in which a process is executed. As a part of contextual information, timestamp, context, measurement units etc. are collected.

Pitfall #14: Inapt communication mode.

Nowadays, our computer software supports a wide range of graphs. And people try to use those graphs altogether and finally making real stuffs hidden or complex. One must select the right graph to communicate the processed data. Run charts, pie charts, control charts and bar charts are all good means of communication, but the best fit must be chosen.

Pitfall #15: Not beginning with the end in mind.

One must determine in advance how the processed data is going to be used. This helps to make good choices in what data to be collected (never waste time collecting data which won’t be used), what tool to be used. Also one must plan to measure everything needed to know how the effect of the change is going to be calculated. It is usually too late to go back and correct things if something is left out.

Further to what you might have read in

Unlike the C chart, U chart does not require a constant number of sample items, and it does not require any limit to the potential number of defects/nonconformities. Further more for a P- chart of NP chart, the number of non conformities cannot exceed the number of items on a sample, but for a U chart it is acceptable because what is being addressed is not the number of defective items, but the number of defects on the sample. In short,

Attribute Data  charts





















The word statistics while referring to a scientific discipline should not be confused with the word statistics referring to a quantity like mean or median calculated from  a set of data  as given in the example.

Statistics is all about collection, organization and analysis of data and making interpretation.

To study a population or process, we collect the required data showing the characteristics of the process. It may not be practically possible to collect the entire set of data. Then, a chosen subset of the population called a sample is studied. Sample should be a representative sample. This data can then be subjected to statistical analysis, serving two related purposes: description and inference.
Continue Reading..