Introducing an AI Technique: Inverse Reinforcement Learning with AAII Investor Survey Data

Inverse Reinforcement Learning (IRL) is a machine learning method in which an agent learns a reward function, a representation of what an expert (e.g., a skilled human or high-performing agent) is trying to achieve, by observing their behavior rather than being directly told what to aim for through explicit rewards or instructions. In everyday life, we do this naturally: you might watch an experienced chef and, without a step-by-step explanation, infer the principles behind their cooking. The inferred reward function explains why the expert behaves as they do, under the assumption that their actions are approximately optimal. This reward function can then be used to train a policy (often through standard reinforcement learning) to replicate the expert’s behavior.

This works because an expert’s actions are rarely random—they reflect underlying preferences or priorities. For example, a skilled driver might slow down near crosswalks, avoid certain lanes, and accelerate smoothly. They don’t explicitly say, “I do this because I value safety, comfort, and efficiency,” but IRL seeks to infer such a reward function from their behavior. Once learned, this reward function can be used to train other agents to act similarly, even in novel situations not present in the original demonstrations.

A Simplified Behavioral Finance Example

Suppose we’re watching an experienced investor make decisions in a simplified market.
Market “states” are described by two factors:

Recent Returns (RR) – % change over the past 3 days
Volatility (vol): 0 = Low, 1 = High

The investor can take two actions:

Buy (Risk-On)
Sell (Risk-Off)

Step 1 – Observations

DayRecent Return (RR)Volatility (Vol)Action
1+0.020Buy
2+0.011Buy
3-0.030Sell
4-0.021Sell
5+0.040Buy
6+0.031Buy
7-0.010Sell
8-0.041Sell

Step 2 – Hypothesize reward structure

The “reward” is the score an expert would give themselves for making a particular choice in a particular situation.
It’s not money or points they literally collect — it’s an abstract number that says, “This decision feels right because it moves me toward my goal.”

The reward function is the full set of rules for how those scores are assigned.
Think of it as a hidden formula in the expert’s head: it takes in the details of the current situation (market trend, volatility, etc.) and outputs a score for each possible action (buy or sell in this case). The expert then picks the action with the highest score.

Let’s assume the reward function is linear in the features:

R(s,a)=wRR(a)*RR+wVol(a)*Vol+b(a)

Where:
𝑤RR(a) = weight for recent returns (momentum sensitivity)
𝑤Vol(a) = weight for volatility (risk attitude)
𝑏(𝑎) is an action-specific bias (e.g., preference for buying vs selling)

Step 3 – Reverse engineering (IRL logic)

From the data:

On all positive RR days, they always Buy.
On all negative RR days, they always Sell.
Volatility changes (0 or 1) do not affect the decision in this sample.

A simple fitted “reward” (the IRL counterpart of a regression model, estimating how much each factor contributes to the decision) could be:

ActionReward formula
Buy10⋅RR
Sell−10⋅RR

Step 4 – Testing on new states

If tomorrow:

  • RR = +0.015, Vol = 0 →Buy reward = 10×0.015=0.15, Sell reward = −10×0.015=−0.15 → Predict Buy
  • RR = -0.025, Vol = 1 →Buy reward = 10×−0.025=−0.25, Sell reward = −10×−0.025=0.25→ Predict Sell

    Even without seeing every possible combo in the data, the IRL-derived reward function can generalize to new situations.

    Behavioral Finance Insight

    This pattern reflects a pure momentum bias—the investor’s decisions are entirely driven by whether recent returns are positive or negative, with no adjustment for volatility risk. In behavioral finance terms, this is consistent with return-chasing behavior, where past winners are bought and past losers are sold regardless of market conditions. Such a strategy can perform well during sustained trends but may underperform in choppy markets where momentum signals reverse quickly. The absence of a volatility penalty also suggests risk neglect, a bias where the investor focuses on recent performance while overlooking risk factors that could affect outcomes.

    In the above momentum example, the “expert” is an investor whose buy/sell choices you’ve logged for several days. You notice a pattern: they buy when recent returns are positive, and sell when recent returns are negative. But instead of just eyeballing the pattern, IRL formalizes it:

    • Record behavior – We feed the AI the full history of market conditions and the corresponding actions the investor took.
    • Propose possible explanations – The AI considers many possible “hidden scoring systems” that might be driving those choices. For example:
      • Maybe the investor rewards themselves mentally for buying after gains.
      • Maybe they reward avoiding high volatility.
      • Maybe they value both in combination
    • Test and refine – For each guess at the scoring system, the AI simulates what the investor would have done if they were truly using it. It keeps adjusting the weights on different factors until the simulated behavior matches the actual behavior as closely as possible.
    • Pick the best match – The final “reward function” is the one that, when used to make decisions, produces patterns most like the expert’s real decisions.



      Retail Investor Sentiment Analysis with IRL and AAII Survey Data

      The AAII Investor Sentiment Survey is a long-running weekly poll of individual investors conducted by the American Association of Individual Investors. It measures the percentage of respondents who are bullish, neutral, or bearish on the stock market over the next six months. In this project, we aligned the AAII survey results with key market data such as the S&P 500 Index (SPX) returns, volatility (VIX) levels, and drawdowns (the percentage decline from a recent peak in the market). Using IRL, we treated the reported sentiment as observed “actions” taken under specific market “states.” The IRL model then inferred the underlying “reward” structure (the implicit incentives or conditions that may drive shifts between bullish, neutral, and bearish sentiment), and visualized these patterns with policy maps to show how sentiment changes across different combinations of market conditions.

      The action space and state space used in our analysis are as follows:

      Action Space (how labels are assigned):

      Each week, the AAII survey reports the percentage of respondents who are Bullish, Neutral, or Bearish about the stock market’s outlook.
      In our model, we assign the action for that week to the sentiment category with the highest percentage:

      • +1 = Bull — Bullish percentage is the highest among the three.
      • 0 = Neutral — Neutral percentage is the highest.
      • –1 = Bear — Bearish percentage is the highest.

      This way, the “action” essentially reflects the dominant sentiment of the surveyed investors in that particular week.

      State Space (The market conditions investors see before choosing):
      Each week’s state is described by four market variables:

      • 1-week SPX return – how much the S&P 500 changed over the past week in percentage terms.
      • 3-week backward-looking SPX return – the market’s performance over the prior three weeks, excluding the current week.
      • VIX level – a measure of expected market volatility (“fear index”).
      • Drawdown – the percentage drop from the market’s highest level over the past 52 weeks.

        Our analysis was implemented in Python using several libraries, including NumPy (for numerical calculations, vectorized operations, and linear algebra at scale), Pandas (to load, clean, and transform the AAII survey and market data into a structured dataset suitable for modeling), Matplotlib (to visualize learned reward weights, time-series signals, and reward/policy surfaces), and Scikit-learn (to split data into training and testing periods, normalize state variables for fair weight comparison, and compute performance metrics to evaluate how well the inferred policy reproduces observed investor actions).

        The main outputs of our model are presented below. We begin by showing the weights associated with different market conditions that drive each type of investor action.

        From the above chart:

        spx_ret_1w = +0.383, spx_ret_3w_back = +0.311 → momentum helps; fresh and multi-week strength both tilt bullish.
        drawdown = +0.765 → strongest driver: near highs / shallow drawdowns favor risk-on.
        vix_level = −0.093 → higher vol slightly suppresses risk-on (small effect vs drawdown/returns).

        From the above chart:

        spx_ret_1w = −0.201, spx_ret_3w_back = −0.338 → weakness favors risk-off.
        vix_level = +0.602 → high VIX strongly pushes risk-off.
        drawdown = −0.588 → deeper drawdowns (more negative) raise risk-off reward (negative × negative = positive).

        While the “bull” and “bear” results are based on investors taking clear directional stances, the “neutral” category is trickier to interpret. Neutral sentiment can reflect a wide range of underlying intentions — from genuine indecision, to short-term caution before a move, to simply staying out of the market. Because these motivations are diverse and not directly observable in the AAII survey, the model’s weights for neutral should be taken with a grain of salt. They are less likely to capture a single, consistent reward pattern compared to the more clearly defined bull or bear actions.

        Next, we show the policy maps. In IRL, a policy map is the learned function that connects states (market conditions) to actions (investor stances). Once the model infers the underlying reward structure, it can derive a policy π(a∣s), which tells us how likely an investor is to take a bullish, bearish, or neutral stance in a given situation.

        Each point on the map represents a combination of two state variables (for example, recent SPX returns and VIX levels). The model then applies its learned reward function to that state and selects the action with the highest preference. The color on the map shows which action the model would favor in that region of the state space—bullish, bearish, or neutral. Importantly, the map is not just a coloring of actual historical outcomes; rather, it is a theoretical surface generated from the estimated reward weights, applied across the full range of observed state variables. This allows us to see how the model generalizes its decision-making beyond the exact points in the training sample.


        Below, we present additional outputs from the analysis, such as accuracy and action biases, to give an idea of the model’s overall performance:

        While our analysis here uses the linear Maximum Entropy IRL framework, a natural next step would be to extend this approach to a nonlinear setting—for example, by using a small neural network to approximate the reward function. This would allow the model to capture more complex, potentially nonlinear relationships between market conditions and investor behavior that a linear model may miss. We leave this nonlinear MaxEnt IRL approach as an avenue for future exploration.

        Like (0)