Interpretable Machine Learning in Finance: A Case Study Using Symbolic Regression (1)

One of the most well-known anomalies in finance is the January Effect, which suggests that small-cap stocks, such as those in the Russell 2000, tend to outperform large-cap stocks, like those in the S&P 500, during the first month of the year. Several factors have been proposed to explain this phenomenon, such as tax-loss harvesting, window-dressing by institutions, and behavioral finance influences. However, this outperformance is not static. Empirical studies using different time frames have shown that its effect has been diminishing. Additionally, beyond January, the relative performance of the two indices does not follow a consistent pattern. For example, historical data indicates that small-cap stocks can sometimes sustain momentum, particularly in early bull markets, while large-cap stocks tend to provide more stability over time. Nonetheless, macroeconomic conditions, interest rates, and investor sentiment also influence these trends, making them less predictable in modern markets.

Given the numerous variables influencing the relative performance of the Russell 2000 compared to the S&P 500—and the inconsistent, often nonlinear patterns observed over time—traditional modeling techniques often fall short of capturing the full complexity of the relationship. Machine learning offers greater flexibility and adaptability in addressing such dynamics. Among these techniques, symbolic regression stands out for its ability to uncover not only accurate predictive models but also interpretable mathematical expressions that reveal the underlying structure of the data. In this article, we apply symbolic regression to investigate the relative performance of the Russell 2000 versus the S&P 500.

Symbolic regression aims to identify the optimal mathematical equation—such as y = a · log(x) + b—that accurately captures the underlying patterns in a given dataset.

Mathematically, it involves searching through combinations of:

  • Operators: +, −, ×, ÷
  • Functions: log, exp, sin, sqrt, etc.
  • Constants
  • Variables: x₁, x₂, …, xₙ

Symbolic regression, particularly when implemented through genetic programming (GP), supports a range of transformation operations—some standard, others optional—each playing a distinct role in shaping and refining the population of expression trees. The two core operations are crossover and mutation, described below.

TransformationWhat It Does
CrossoverCombines two parent trees by swapping subtrees
MutationModifies one parent by replacing a random subtree


Crossover is a process where two “parent” mathematical expressions exchange parts of their structure (subtrees) to produce a new “child” expression. Consider the following example:

In the above example, a crossover operation was performed by selecting the + subtree (i.e., x + 1) from Parent A and the * subtree (i.e., y * 3) from Parent B. The + subtree in Parent A was replaced with the * subtree from Parent B, resulting in a new expression:
Child = sin(y * 3).

This child inherits:

  • The outer function (sin) from Parent A
  • The inner multiplication logic (y * 3) from Parent B

In parallel, mutation introduces diversity into the population by randomly modifying parts of an expression tree—such as replacing a node, altering a function, or adjusting a constant. This stochastic alteration helps the algorithm escape local optima and explore a broader search space, thereby increasing the likelihood of discovering high-performing symbolic expressions.

Here is an example:

The mutation logic illustrated above involves selecting a random node in the tree (in this case, 2) and replacing it with a randomly generated subtree—for example, sin(y).

To be Continued in Part 2

Like (0)