An Algorithm is quite simply a set of steps to be taken in order to accomplish a specific task. Algorithms can be as simple as the steps that you take to prepare breakfast in the morning, but we are more familiar with them in more complex situations, for example, computer programming. More and more we rely on algorithms in our daily life – from video transmissions across the internet which are made possible with the use of audio and video compression algorithms; to Google maps which uses route finding algorithms.
The word algorithm comes from the Latin Algorithmi, which in turn is a translation of the arabic name Khwarizmi, the man who is very much credited with the foundation of algebra. Mohammed ibn-Musa al-Khwarizmi was a Persian mathematician, astronomer and geographer who wrote the book “The Compendious Book on Calculation by Completion and Balancing” in circa AD 850, which was subsequently translated into Latin into about the mid 12th century and basically presented the world with the first systematic approach to solving linear and quadratic equations. It includes an approach called al-jabr from which the term algebra is derived. BTW, I have checked and unfortunately this book in currently out of stock on Amazon with no restocking date.
Good algorithms solve problems correctly and efficiently but not always exactly. Lets take data mining, for example, where we use a set of heuristics and calculations to create a model. The algorithm(s) analyses the input data for patterns and trends and uses the results to find the optimal parameters for the data mining model.
Further on we will look at 2 algorithms, Lift and Chi Squared, which measure and test for correlation between two events respectively. Before we get into the examples though, a word of warning with respect to all statistical analysis.
Correlation not Causation – it is important to remember that correlation between two variables does not necessarily mean that one causes the other.
Intellectual laziness can lead to Post Hoc, Ergo Propter Hoc (after this, therefore because of this), the false cause fallacy. So, you are driving down the road and a black cat crosses your path. 15 minutes later, you drive over a chunk of metal that punctures your tyre. Would you really succumb to the idea that this was bad luck as a result of having encountered the black cat?
There are so many other fallacies that you need to consider when analysing data and looking for cause and effect. My favourite, which we have all encountered and some people are definitely more prone to using it than others, is Circular Reasoning.
So algorithms will definitely help to establish relationships between entities such as correlation but human intelligence, subject matter knowledge, experience and good old fashioned common sense are still an essential ingredient when using this analysis to select intelligently from a vast number of possible decisions. In saying that, take care not to over complicate with hypotheticals, think of Occam’s Razor
We do not need to worry too much about Occam’s Razor with our Lift and Chi Squared algorithms but we will see where both could give us very misleading results, for example, where there is a very high level of Null transactions (i.e. transactions that contain neither A nor B, the variables that we are testing for correlation).
Let’s take the example of the supermarket who wants to understand the the likelihood of a customer wanting to buy coffee if they buy biscuits and visa versa.
Using Lift to analyse the relationship, we get the following results.
(For the purposes of the following calculation, B represents biscuits and C represents coffee).
Using Lift, the correlation is said to be negative if the result is less than 1, independent if it is equal to 1 and positive if it is greater than one.
So, we can see that there is a positive correlation between the purchase of biscuits and the purchase of coffee so the question for the retailer is whether these products should be positioned closely together to maximise the revenue from this correlation or if they should position the products far away from each other with a view to maximising spontaneous spending,
As promised, we can use chi squared to test for correlation a similar type dataset.
Using chi squared, a result of greater than zero indicates there is a correlation. You can see from the relationship between the observed (or actual) and the expected whether the correlation is positive or negative (where observed is greater than the expected, the correlation is positive and where the expected is greater than the observed, the correlation is negative)
In this case we get a result of 12.5 and this is a positive correlation as the observed value for biscuits and coffee (900) is greater than the expected (800).
Let’s now look at a situation where, although the relationship between the Biscuit and Coffee events remains the same, the Lift result might just get us a little confused.
Let’s take the example of Question 1 where we saw a positive correlation for Lift(B, C) with a result of 1.05. Here though, we change the value of the ^B ^C from 200 to 10,000. So, we still see a positive correlation for Lift(B, C) which you would expect as the ratio of times that one appears with the other to the number of times one appears without the other is the same. However, the 8.40 would appear to indicate a much stronger correlation which of course is not the case.
So, you can see why it is important to understand the sort of parameters where the Lift result would provide you with a result on which you could make business decisions and where you might want to do a bit more analysis. The result calculated above is impacted by the number of Null transactions (i.e the number of transactions that do not include either B or C).
For such a scenario, a Null-invariant measure of interestingness is recommended. As the title suggests, these algorithms are not impacted by the number of Null transactions.
So, let’s both of the dataset examples used for Lift above only this time using the Kulczynski algorithm.
We will set epsilon so that we are confident that anything below this value is of no interest to us. My understanding is if Kulczynski is near 0 or 1, then we have interestingness that is negatively or positively associated respectively.
The Null-invariance algorithm Kulczynski produced exactly the same results for both datasets and it will continue to give the same result regardless of the value entered for ^B^C
There are other Null-invariant measures that can also be tested out such as Cosine (Null-invariant version of Lift as some would say), All Confidence, Max Confidence and Jaccard.
Being new to data mining specific algorithms, it is very interesting looking at the results of the few that I have worked with. I would like to spend so much more time understanding more of these algorithms but most importantly, getting an understanding of which algorithm is best suited for analysing various types of datasets, what are the potential fallacies of each and what ones should be used in conjunction with each other to give me the best possible understanding of the relationship between the events of my dataset.