Alternative data – a trendy subject nowadays – is non-traditional data that can be used in the investment process. An increased percentage of hedge fund managers are planning to use this kind of data and new analytics in their investment processes. Examples of alternative data are:
- Social media
- Web crawls
- Satellite & weather
- Consumer credit
- Internet of Things
- Mobile App usage
- Store locations
The amount of such data may be huge. The logic behind adopting alternative data is:
“Large amounts of data lead to more available information for better analysis”
This statement is not necessarily always true. Let’s see why. Before you start to solve a problem (a numerical problem) you should check its conditioning. An ill-conditioned problem will produce very fragile and unreliable results – no matter how elegant and sophisticated solution you come up with it may be irrelevant or simply wrong.
Simple and basic problem
Consider a simple and basic problem, a linear system of equations: y = A x + b. If A is ill-conditioned, the solution will be very sensitive to entries in both b and y and errors therein will be multiplied by the so-called condition number of A, i.e. k(A). That’s as far as simple linear algebra goes. However, most problems in life cannot be tackled via a linear matrix equation (or any other type of equations for that matter). This does not mean, though, that they cannot be ill-conditioned, quite the contrary.
Most problems in life cannot be tackled via a linear matrix equation (or any other type of equations for that matter)The numerical conditioning of a problem should always be computed before one can attempt its solution. How often is this done? Very very rarely. Once you’ve determined that a problem is well-conditioned, there is the issue of determining if it will allow a solution, multiple solutions or none. Those who practice math on a daily basis know this well. Nothing new under the Sun. However, if you collect huge amounts of data and you don’t check its numerical conditioning before you start to work on it, you may be playing an extravagant video game.
There is one fundamental issue, in our view, which makes huge problems/huge data sets difficult to solve – high complexity. Close to critical complexity – each system has such a threshold – means the problem (or system) is very ill-conditioned and dominated by uncertainty (i.e. is chaotic). Imagine the linear system of equations y = A x + b in which the entries of A are not crisp values but fuzzy. In other words, suppose that a particular entry aij assumes values from a certain range and that the “exact” value is unknown. This changes the situation dramatically as the system can lead to a huge number of solutions.
Chaotic soup of numbers
Suppose, now, that a huge set of alternative data has been collected. How can one determine if this data is of value, i.e. if it contains structure and useful rules or if it is just a chaotic soup of numbers? This can be done easily by measuring the data set’s complexity and corresponding critical complexity. Their ratio is a good proxy of numerical conditioning (i.e. k(A)). Very simple examples of what we’re talking of are shown in the figure below.
The case on the left hand side corresponds to a low-complexity high-correlation situation whereby one may extract a crisp and useful rule (i.e. ‘if X increases then Y increases’). On the other extreme data is uncorrelated and no rule may be extracted.
When high complexity kicks in there is no such thing as accuracy or precisionSo, the complexity/critical complexity ratio for a data set is a sort of data set rating – a low value points to data which can deliver useful information, while values close to 1 reveal a situation dominated by chaos and noise.
One last point. Data and the corresponding analyses must be relevant, not accurate. When high complexity kicks in there is no such thing as accuracy or precision. The Principle of Incompatibility, coined by L. Zadeh states that ‘high complexity is incompatible with high precision’. In other words, when complexity is high, ‘precise statements lose relevance’ and ‘relevant statements lose precision’.
Precise statement that is irrelevant: the probability of default of a given corporation in the next 3 years is 0.025%.
Relevant statement that is not precise: there is a high probability that it may rain tomorrow.
Alternative data, or Big Data, can be very complex. How complex? Well, you need to measure it, but suppose that indeed its complexity is high. In such circumstances don’t delude yourself – the information you extract from it will not be precise. Adding more data is not synonymous to adding more information.