Can we make a more risk-aware portfolio agent from utility theory?

Portfolio RL usually means a Markov setup, a scalar reward each step (return or something like mean–variance), and PPO or A2C with a discounted value. That is easy to implement, but it folds risk attitude and patience into one discount and one reward stream. Asset-pricing models often separate risk aversion from intertemporal substitution; recursive utility (Epstein–Zin) is the textbook way to do that.

This post looks at using that objective inside actor–critic training when the Bellman equation is no longer a straight discounted sum. Below: what changes in the math, how I set up the environment and backup, what happened on South Korean ETF data, and where the approach remains limited.

What changes if you leave plain discounting

With discounted return, the value backup is linear in expectations in the usual way. Recursive utility instead mixes current utility with a certainty equivalent (CE) of continuation value. $γ$ controls how harshly bad tails in the distribution of future value are treated; $ψ$ controls patience versus willingness to move utility across time. You are no longer forced to express both with a single discount factor.

The headache is practical: the CE is a non-linear function of the distribution of $V (s^{'})$ , so under real returns you rarely get a closed form. The workaround I used is Monte Carlo: sample several next states, evaluate the critic on each, and build an empirical power-mean approximation to the CE. That feeds both the critic target and a Bellman-residual-style advantage for PPO / A2C. It is noisy and depends on $K$ and how returns are modeled—more of an engineering knob than a guarantee of stable training.

The pieces in math (minimal)

I’ll keep notation light and focus on the parts that differ from the usual discounted backup.

State: $s_{t}$ includes log-wealth $w_{t}$ and last weights $α_{t - 1}$ .
Action: choose new weights $α_{t}$ on the simplex.
Wealth update:

w_{t + 1} = w_{t} + lo g (1 + α_{t}^{⊤} R_{t + 1})

Recursive utility uses a certainty equivalent (CE) of continuation value. A convenient Monte Carlo approximation is:

CE_{t} = [\frac{1}{K} k = 1 \sum K V_{ϕ} (s_{t + 1}^{' (k)})^{1 - γ}]^{1/ (1 - γ)}

Then you plug it into a value target. In words: combine a “current-period” term (through $κ$ and wealth) with $β$ times the CE, then map back through the Epstein–Zin aggregator. The implementation I used trains:

Critic target:

V_{ϕ} (s_{t}) \leftarrow T_{t}^{EZ} (κ_{t}, w_{t}, CE_{t})

Advantage:

A_{t} = T_{t}^{EZ} - V_{ϕ} (s_{t})

This is why the objective is tied to critic-based methods here: $V_{ϕ}$ appears inside $CE_{t}$ and in the residual.

Environment and backup in plain terms

State: log wealth $w_{t}$ and last period’s weights on the simplex (long-only, weights sum to one). Actions: adjust weights via increments from the previous allocation, then project back to the simplex—easier to learn than predicting raw weights from scratch.

Value recursion: a current-period piece plus $β$ times the CE of $V (s_{t + 1})$ . I used a consumption-like scalar $κ$ scaled off wealth as an accounting handle so the recursion stays well defined for a machine that does not literally consume. With $γ > 1$ , the CE is more pessimistic about left-tail future value than a plain expectation—if the Monte Carlo piece is doing its job.

Training: critic MSE to the recursive target; advantage from target minus $V_{ϕ} (s_{t})$ (with an optional multi-step twist). Only critic-based methods work in this wiring: the value net has to appear inside the CE and in the residual. I initialized the policy from Campbell–Viceira-style rules so training did not start from a random corner of the simplex.

What I ran on Korean ETFs

Data: daily Korean ETF prices, 110 names after basic liquidity and missing-data screens. Ten chronological train/test splits (train share from 50% to 90%); I report mean ± std over splits like in FinRL-style setups.

Objectives compared under the same environment: naive discounted return, Markowitz-style per-step mean–variance reward, and recursive utility as above. Algorithms included Random, REINFORCE, A2C, and PPO; recursive targets were only hooked up for A2C and PPO.

PPO snapshot (test): in this setup, recursive scored higher than naive on the main table: Sharpe roughly 2.07 ± 1.04 vs 1.22 ± 1.07, max drawdown roughly 10.4% vs 12.3%, and cumulative return positive on average for recursive vs negative for naive, with volatility in a similar band. Markowitz under PPO sat between the two on Sharpe. The spread across splits and seeds is still wide, so the ranking should be treated as unstable.

Baselines: equal-weight (1/N) still beat these PPO runs on Sharpe in the same experiment log. So the comparison that holds is recursive vs naive under identical RL code, not “beats everything simple.” One trial per split and sensitivity to K and window length are real limitations.

Caveats and takeaway

Recursive utility is one way to encode finance preferences in RL, but it adds sampling overhead, a custom value backup, and critic-only constraints. On my ETF splits it sometimes looked better than a naive discounted objective under PPO, but that does not establish robustness outside this experimental setup.

Bottom line: performance is conditional. If you only need a portfolio policy, start with simple rewards and strong simple baselines. If you want to separate risk aversion from discounting in the objective itself, recursive utility is a useful framework to test; treat these ETF results as one empirical reference point, not a final conclusion.

References

Recursive utility + RL — Chang, M. Portfolio Optimization under Recursive Utility via Reinforcement Learning.
Recursive utility — Epstein, L. G., & Zin, S. E. (1989). Substitution, Risk Aversion, and the Temporal Behavior of Consumption and Asset Returns: A Theoretical Framework. Econometrica, 57(4), 937–969.
Strategic asset allocation — Campbell, J. Y., & Viceira, L. M. (2002). Strategic Asset Allocation: Portfolio Choice for Long-Term Investors. Oxford University Press.
RL fundamentals — Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Actor–critic — Mnih, V., et al. (2016). Asynchronous Methods for Deep Reinforcement Learning. ICML.
PPO — Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. Preprint.
Deep RL for portfolio management — Liu, X.-Y., et al. (2020). FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading. NeurIPS Workshop on Machine Learning for Engineering Modeling, Simulation, and Design.