Regret: -at a time t, is a different between our algorithm total loss and the best single expert loss. -expresses how much we regret for not following the single best advice.
$$ R = L{H}^{T} - min{i}(L_{i}^{t})$$
An online algorithm H learns without regret if in the limit as T goes into infinity, its average regret goes to zero in the worst case. Meaning no single expert is better than H in the limit its average regret goes to zero in the worst case - no single expert better than H in the limit.
Average regret is the cumulative regret divided by time steps.
Information sets: players perspective of game state (i.e. my cards + opp cards + community cards)
Repeatedly making decision in an uncertain environment. Receive advice from N experts about a single phenomenon. Our online algorithm H’s goal is to distribute trust among experts.
After outcome is revealed, there is a loss vector that evaluates the N experts.
Optimises entity called immediate counterfactual regret, which is an upper bound on average overall regret.
Guaranteed to go to zero in the limit if regret matching is applied.
$$ \sum_{h’ part of Z} \pi^{\sigma} (h, h’)u(h’)$$
Z is the set of all terminal nodes
If this is considered for every possible game state (every possible opponent’s hand), we will end up with a vector of utilities, one for each hand.
For counterfactual utility, another weighting scheme is used. The probability of reaching h, assuming that we wanted to get to h is used instead.
Instead of using the regular strategy from strategy profile,
Distinguish: reward for playing action a at time t Every action will be rewarded via unnormalized counter factual utility assuming action was played.