The Philosophical Data Analyst: Some Variables are from Extremistan
Most organizations understand data as an asset, providing a rich resource that can be analyzed to unearth both descriptive and predictive insights. Data-driven decision-making is promoted by many organizations. More and more, organizations are processing and analyzing data in real-time, automating operations based on results, making them reliant not only on the data but on the methods used to analyze it.
Many organizations invest heavily in their data operations to ensure the ongoing completeness, integrity, and accuracy of collected data. However, regardless of how complete, correct, and/or unbiased collected data may be, there are limits as to what insights can be gleaned from a given dataset. Acting on analytical insights outside these limits introduces risk.
Advertisement
This article describes risks posed by data variables that are susceptible to seemingly improbable or extreme values. It explains the characteristics of susceptible data variables and outlines a simple technique for identifying them. Once identified, Business Data Analysts can assess the impact of using these variables in the analysis, allowing any risks to be quantified and mitigated and/or alternatives analytical approaches explored.
Mediocristan vs. Extremistan
It is common to describe the range and frequency of variable values using statistical distribution patterns. The ability to describe data variables using a particular distribution pattern is a prerequisite for many data modeling methods. The most common statistical distribution pattern is Normal Distribution or the bell curve.
Image credit: MathsIsFun.com
For a normally distributed data variable, the frequency of values is symmetrically clustered around a mean – the further away from the mean, the less likely the data variable will take on that value. However, in some cases, data variables may appear to hold a certain statistical distribution pattern when, in fact, they may legitimately take on values that (given the distribution pattern) are deemed improbable or extreme – particularly in cases where a variable is subject to complex and/or unknown external influences.
In The Black Swan, Taleb introduces the idea of Mediocristan and Extremistan. Taleb defines Mediocristan as subject to the routine, obvious and predicted, while Extremistan is subject to the singular, accidental, and unseen (Taleb, pg. 35). Applied to data analysis, data variables from Extremistan are susceptible to extreme and/or unpredictable values, while those from Mediocristan are not. The argument is that data from Extremistan cannot be accurately described using common statistical distribution patterns – and certainly cannot be described using normal distribution as value frequency is not symmetrical.
For example, take a sample of 1000 randomly selected human beings. If you were to calculate the average height of the group, what would you expect the answer to be? Now add the tallest human on earth to the sample (which will now comprise 1001 humans) and recalculate the average height of the group. How much would you expect the result to change? The answer is not a lot as there is a limit as to how tall a human can be. While adding the tallest human on earth to the sample would cause the overall average to rise, the impact would be minor. Human height is an example of a variable from Mediocristan.
Now perform a similar thought experiment, except this time use net worth instead of height. What would you expect to happen to the calculated average net worth of a randomly selected group of 1000 human beings if you were to add the richest person in the world to the group? In 2021, Forbes identified Jeff Bazos as having the highest net worth in the world, estimated to be around US$177 billion. You would expect the average from the sample that included Jeff Bazos to be dramatically higher compared to one that didn’t. Net worth is an example of a variable from Extremistan.
The problem is that if you were to take a random sample of 1000 humans from earth, what is the likelihood that it would include Jeff Bazos? Or Bill Gates? Or Jay-Z? Or the Queen? Or anyone else with a much higher net worth than average? And if the sample did happen to include one of these individuals, what would you do with the offending value? Would you treat it as a true representation of the entity being described? Or would you discard it as an outlier?
Extreme values can be mistaken for outliers when they are in fact indicative of the behavior of the entity they are representing. Analysis that does not account for Extremistan variables properly may prove unreliable – particularly when accurate but extreme values enter the underlying data. Extremistan variables may also be subject to extreme changes in value as a result of seemingly unlikely or improbable events (for example, a sudden stock market shock impacting the net worth of some individuals more than others).
What to do in Extremistan?
Taleb proposes a method for modeling Extremistan variables based on the work of the mathematician Mandelbrot, the pioneer of factual geometry. However, mathematics is complicated and beyond the capabilities of most organizations. (As most Business Data Analysts would know, explaining analysis based on simple mathematics to stakeholders can be a struggle – let alone analysis that uses more complex mathematical modeling techniques). Understanding Extremistan variables, how they contribute to the analysis, and mitigating any risks their use poses is a more realistic goal for most organizations.
Start by classifying data variables into the categories ‘Mediocristan’ and ‘Extremistan’. The table below provides some guidance on the characteristics of Mediocristan and Extremistan variables.
Mediocristan |
Extremistan |
Non-scalable |
Scalable |
The most typical member is mediocre |
There is no ‘typical’ member |
Winner gets a small segment of the pie |
Winner takes all |
Impervious to Black Swan (seemingly improbable) events |
Vulnerable to Black Swan (seemingly improbable) events |
Often corresponds to physical quantities with limits |
Often corresponds to numbers with no limits |
Physical, naturally occurring phenomena are often from Mediocristan |
Variables that describe social, man-made aspects of human society are often from Extremistan |
Examples include height, weight, age, calorie consumption, IQ, mortality rates… |
Examples include income, house prices, number of social media followers, financial markets, book sales by author, damage caused by natural disasters… |
(Adapted from Taleb, pg. 35, 36)
Once classified, Business Data Analysts can identify where and when Extremistan variables are used in the analysis, and whether they pose any risk to the accuracy/reliability of analytical outputs. In many cases, this can be done simply by identifying or estimating extreme data points (such as the Jess Bazos in the example above), adding them to the underlying data, and assessing their impact on the analysis.
Note that using Extremistan variables in the analysis is not necessarily a problem – it depends on how they have been analyzed and the insights that are drawn from the analysis. Some analytical and modeling techniques will be able to deal with Extremistan variables without introducing much risk. However, be wary of analysis the assumes Extremistan variables to be normally distributed and/or simply treats legitimate extreme values as outliers.
When classifying variables, it is also important to consider the scope of the data collection, and the context in which it is being analyzed. Take for example a sample of bakers who live in a certain region. You may want to use data collected from this sample to predict the income of other bakers in the same region. Assuming there are no issues with data quality and/or data collection, baker income is likely to be a variable in Mediocristan as there is a limit to how much bread a baker can bake in a day, and price/demand variability for baked goods is usually low. On the other hand, take the same example and replace ‘baker’ with ‘social media influencer’. A social media influencers income is subject to a more complex and ephemeral range of factors, such as numbers of clicks, ‘fame’, the popularity of social media platforms, etc. As such, social media influencer income is more likely to be from Extremistan.
Conclusion
Data is an asset. Data-driven decision-making can help increase efficiency, drive innovation, and reduce bias in decision-making. However, it is important to understand that there are limits to the insights that can be drawn from a given dataset. By identifying variables that may be subject to extremes, analysts can ensure these variables are appropriately accounted for in analysis by assessing any risks, and ensuring analytical insights are considered in context.
But know this – variables from Extremistan are anything but normal!
References:
- Taleb, Nassim Nicholas, The Black Swan: The Impact of the Highly Improbable, Random House, 2007.
- Guide to Business Data Analytics: Getting Better Insights. Guide Better Informed Decision Making. IIBA, 2020.
- Normal Distribution, MathsIsFun.com, 2021. (Last accessed Jan 2022).
- Dolan, Kerry A., Forbes 35^{th} Annual World’s Billionaires List: Facts and Figures 2021, Forbes, Apr 2021. (Last accessed Jan 2022).
- Penn, Amanda, Extremistan: Why Improbable Events Have a Huge Impact, Nov 2019. (Blog last accessed Jan 2022).