In this analytical study we derive the optimal unbiased value estimator (MVU)
and compare its statistical risk to three well known value estimators: Temporal
Difference learning (TD), Monte Carlo estimation (MC) and Least-Squares
Temporal Difference Learning (LSTD). We demonstrate that LSTD is equivalent to
the MVU if the Markov Reward Process (MRP) is acyclic and show that both differ
for most cyclic MRPs as LSTD is then typically biased. More generally, we show
that estimators that fulfill the Bellman equation can only be unbiased for
special cyclic MRPs.