The Curse of Dimensionality

Jitesh Jain
Vision and Language Group
9 min readJun 10, 2020

--

Have you ever found yourself stuck in the middle of a problem having a large number of dimensions? You keep on trying harder and harder to come up with a solution that would satisfy all the dimensions of the problem. Still, every time one or another dimension makes ur efforts look futile! It turns out that you can solve a problem very efficiently when presented with not more than 2 or 3 dimensions. Nevertheless, this world is not kind to us every time! You might find yourself struggling amidst a problem, thinking about as many as hundreds or thousands of dimensions simultaneously!

Now, you may ask: “Wait! Does that happen in real life, or are you just making stuff up to sound cool?”

Well, it does happen! Fortunately, in our day to day lives, we do not observe these complications unless faced with an exceptionally complex (mathematical) task. Take some time to appreciate the ability of your mind to effortlessly solve complex tasks for you!

Indeed!

Again, you may ask: “Why should I care about this?”

With the rise of Artificial intelligence, the large number of dimensions present a problem that, if not taken care of, might render all your efforts developing an algorithm just futile! You might end up all confused and lost around your developed model. You might feel as if been cursed by some evil power :P

The DIMENSIONALITY has cursed us!

All right! Now you understand the importance of confronting The Curse of Dimensionality. Let’s dive in to understand the concept behind the curse and how to nullify its effect!

What is the “curse”?

The curse of dimensionality, first introduced by mathematician Richard Bellman, indicates that the number of samples needed to estimate an arbitrary function with a given level of accuracy grows exponentially with respect to the number of input variables (i.e., dimensionality) of the function.

Let us understand with an example: We have six balls, three of which belong to category BLUE, and the other three belong to the category RED. Now, let’s say our pet dog wants to play only with bluish balls. We want to develop an ML algorithm to cluster out the balls so that we do not have to separate them.

We have two options for choosing our training set. Let us see what happens, selecting either one of them:

Training set with two features.
Our pet would be happy!

There are six balls, each one is either bluish or reddish in color.

Two perfect clusters are formed by the learning algorithm.

Training set with six features.
Our algorithm botched!

There are six balls, each one having a unique color.

Six clusters are formed by the learning algorithm.

The algorithm is stuck inside the vortex of dimensionality.

So, which one should we choose? 1st one, right? Yes indeed! The 2nd training set suffers from The Curse of Dimensionality.

Now, we understand what the curse is. Let’s see what might be leading to it and its implications.

Why might this be happening?

Reaction after seeing results cursed by DIMENSIONALITY.

(Number of Features) > (Number of Observations)

  • The Curse of Dimensionality refers to the situation when our data has too many features, i.e., the algorithm is not learning the required target function but instead gets caught in the vortex of dimensions.
  • It is similar to when someone tries to explain something to you in a complicated way, talking about things that do not relate to the concerned topic. The person is completely off the subject with so many unimportant features.
  • In our example with six balls, the training set 2 had values too, which made the number of dimensions (or features) six against only two observations (bluish or reddish).
  • The added features act as noise in the data, which decreases the accuracy of our model.

What are the implications?

In today’s big data world, when your data has a massive number of dimensions, the curse of dimensionality can create issues in your learning algorithm:

1. Overfitting

  • Overfitting occurs when estimating relatively few parameters in higher dimensional space or estimating many parameters in lower-dimensional space. Additionally, the variance of a parameter estimate increases if the number of parameters to be estimated increases.
  • In contrast, the bias of the estimate and the size of the training set are constant, which suggests that the quality of the parameter estimates decreases as the dimensionality rises due to increasing variance.
  • If we have more features than observations than we run the risk of massively overfitting our model — this would generally result in terrible out of sample performance.

2. Sparsity

  • The exponential growth in data causes high sparsity in the data set, i.e., the volume of the space represented proliferates that the data cannot keep up and thus becomes sparse.

Let’s play a game: we have lost 5 coins and need to find them in three different situations:

First: Just one feature: on a 10 m line (1 dimension) — Easy to find the coins!

Second: Add a feature: on a 100 m² square — Need to put more effort now.

Third: Add another feature: inside a 1000 m³ cube — No more interested in playing!

With only three dimensions, the game becomes a nightmare!
  • The difficulty in finding the coins in a high-dimensional space implies a vital fact: the points tend to distribute more sparsely (i.e., they are more dissimilar to each other) in high-dimensional space because the Euclidean distance between two points is significant as long as they are far apart in the direction of at least one dimension, even though they may be quite close in the direction of all the other dimensions.

3. Euclidean Distance

  • When we have too many features, observations become harder to cluster.
  • Too many dimensions cause every observation in your dataset to appear equidistant from all the others. And because clustering uses a distance (eg., Euclidean distance) to quantify the similarity between observations, this is a big problem.
  • If the distances are all approximately equal, then all the observations appear equally alike (as well as equally different). No meaningful clusters can form as in our six balls example with training set 2.

Let’s understand intuitively! Read the formula below:

The percentage of overall space filled by a hypersphere inside a hypercube.

The above formula generalizes the percentage of space filled by a hypersphere inside a hypercube for all dimensions n, where Γ(n/2 +1 ) is the Gamma function. It can be easily observed that the above quantity tends to approach zero for higher values of n, which suggests that more and more data starts to concentrate at the corners(i.e., towards a single dimension) of the hypercube, which not only increases sparsity but also renders the distance function obsolete.

The generalized formula for the volume of a hyper-sphere with radius r in n dimensions.

4. Computational Power

  • Working with data becomes more demanding as the number of dimensions increases. Like many challenges in data science, this boils down to combinatorics.
  • With high-dimensional data, we simply cannot comprehensively sample all the possible combinations, leaving vast regions of feature space in the dark.

5. Dimensional Redundancy

  • We may not even need to subject our machines to such demanding work. Having many dimensions is no guarantee that every dimension is especially useful. A lot of the time, we may be measuring the same underlying pattern in several different ways.
  • Generally, values added by many dimensions in high-dimensional data is redundant for our algorithms.

Now, one question that might be going in your mind might be:

“How to decide the best number of features for our training?”

Unfortunately, no fixed rule defines how many features we should use in a regression/classification problem. The magic number depends on the amount of training data available, the complexity of the decision boundaries, and the type of classifier used.

For example, if a theoretical number of training examples were available, we would not have to deal with this curse. We could simply use an infinite number of features to obtain the perfect classification. However, an infinite number of features requires an infinite number of training examples, eliminating the real-world usefulness of our network!

Hughes Phenomenon

Hughes phenomenon shows that as the number of features increases, the classifier’s performance increases as well until we reach the optimal number of features. Adding more features based on the same size as the training set will degrade the classifier’s performance.

Most disconcerting, the number of training data needed increases exponentially with each added feature.

As the dimensionality increases, the classifier’s performance increases until we reach the optimal number of features. We notice that further increasing the dimensionality without increasing the number of training samples results in a decrease in classifier performance.

Breaking free of the CURSE

All right! So far, we have seen what the curse is and how it affects our models, and you might be waiting for the solution to the curse of dimensionality. Fortunately, some techniques help us to combat this problem, let’s learn about them.

The saviors are here!

As you might have guessed, the best way to train our model on the limited training data is to reduce the number of features (or dimensions). One algorithm capable of doing it is known as Dimensionality Reduction.

  • The algorithm starts by locating the underlying trends in our features, i.e., extracting the features out of the data.
  • In data science, these underlying trends hidden in our features are often called latent features.
  • Next, the algorithm estimates and rewrites every other feature in terms of their exposures to these latent features.
  • We can then classify data points using distance measures.

Some of the techniques for Dimensionality Reduction are:

1. Principal Component Analysis (PCA)

  • PCA seeks to explain the correlation structure of a set of predictor variables, using a smaller set of linear combinations of these variables. These linear combinations are called components.
  • It means that the total variability of a data set produced by the complete set of m variables can be accounted for by a smaller set of k linear combinations. There is almost as much information in the k components as there is in the original m variables.
  • Check out this link for more information on PCA.

2. Linear Discriminant Analysis (LDA)

PCA seeks axes that best describe the variation within the data. Linear Discriminant Analysis (LDA) finds axes that best discriminate between two or more classes within the data.

We can achieve the task by calculating two measures:

  • within-class variance
  • between-class variance.

The objective is to optimize the ratio between them. There is minimal variance within each class and maximal variance between the classes.

For more methods on how to combat the curse of dimensionality, check out this link.

Conclusion

Today, we confronted the Curse of Dimensionality. We saw the reasons leading to the curse, how it affects our algorithms and at the end, ways to break free of the curse.

A careful choice of the number of dimensions (features) to be used is the prerogative of the data scientist training the network. In general the smaller the size of the training set, the fewer features one must use. One must keep in mind that each added feature increases the data set requirement exponentially.

Now, we understand the importance of selecting the right number of dimensions for our training data!

Good Job :))

Congratulations on making it through the article!

NOTE: The animations were created using the manim engine. The code for the animations can be found on GitHub.

References

--

--