Let $\sigma_0, \sigma_1, \sigma_2, \dots$ be a sequence in $\{-1,+1\}$ and $T \in \mathbb{N}$ a time horizon.
Consider the following game. At each time step, we're asked if we want to give an answer $X_t \in \{-1,1\}$, or to abstain and skip to the next turn. If at time $t$ we decide to give an answer $X_t \in \{-1,1\}$, if $k(t)$ is the number of answers we have given just before time $t$, we're going to observe whether $X_t = \sigma_{k(t)}$ or not. If they are equal, the game proceeds further. If not, the game stops. To help us determine the signs $\sigma_k$, at each time step $t$, we're going to observe a random variable $Z_t$ which, conditioned to the history we've observed so far, is a $1$-subgaussian random variable with mean $\sigma_{k(t)} \cdot 2^{-k(t)}$ (recall that $k(t)$ is the number of answers we have given just before time $t$). Intuitively, the absolute values of the means of these random variables shrink geometrically as we give more and more correct answers, but their signs depend on the unknown quantities $\sigma_0, \sigma_1, \sigma_2,\dots$.
Our losses will be measured in the following way. If at some point we have erroneously guessed $X_t \neq \sigma_{k(t)}$, we directly pay $T$. If not, and $t_1,\dots,t_K \le T$ are the turns before the time horizon $T$ where we have answered questions, what we pay is $$ t_1 + \sum_{k=2}^K (t_k-t_{k-1}) \cdot 2^{1-k} + (T-t_K) \cdot 2^{-K} \;.$$ In words, when we don't lose, we pay the cumulative sum of the absolute values of the means of the 1-subgaussian random variables we have seen.
I'm wondering if we can devise a sequential strategy that on the basis of the observed $Z_1,Z_2,\dots$ allows us to pay in expectation just $O(\sqrt{T})$, regardless on how the $\sigma_0,\sigma_1, \sigma_2, \dots$ were chosen.
I tried to devise a strategy based on confidence intervals. Specifically, fix $\delta >0$. We are going to provide answers at fixed times, specifically at those times inductively defined by $$t_k := t_{k-1} + \bigg\lceil c\cdot 4^{k-1} \log\Big(\frac{2}{\delta}\Big) \bigg\rceil$$ where by definition $t_0 := 0$ and $c$ is a universal constant whose role will be explained below. Now, notice that if the game has gone on up to time $t_k$, then at that time we have seen $\bigg\lceil c \cdot 4^{k-1} \log\Big(\frac{2}{\delta}\Big) \bigg\rceil$ realizations of 1-subgaussian random variables with mean $\sigma_{k-1} \cdot 2^{1-k}$. In any case, we predict according to $$X_{t_k} :=\operatorname{sgn} \bigg( \sum_{t=t_{k-1}+1}^{t_k} Z_t \bigg).$$ The reason why we do so is that, for a suitable universal choice of $c$, the open $\delta$-confidence interval around the above empirical mean at time $t_k$ will contain $\sigma_{k-1}\cdot 2^{1-k}$ with probability at least $1-\delta$ and its radius is (upper bounded by) $2^{1-k}$. It follows that using the previous strategy, when $\sigma_k = 1$, the empirical mean will be greater than $0$ with probability $1-\delta$, and when $\sigma_k = -1$ then the empirical mean will be less than $0$ with probability $1-\delta$ (hope not to have done any mistake in the calculations, but I think that the procedure should be sound). Using a union bound, we will make a mistake with a probability upper bounded by $\delta \cdot K$, where $K \approx \log_4\left(\frac{T}{c \cdot \log\left(\frac{2}{\delta}\right)}\right)$. On the other hand, when we do not make any mistake, using this expression for $K$, we're going to pay $$ t_1 + \sum_{k=2}^K (t_k-t_{k-1}) \cdot 2^{1-k} + (T-t_K) \cdot 2^{-K} = O \bigg( \sqrt{T \log\Big(\frac{2}{\delta}\Big)}\bigg) \;.$$
Overall, in expectation, we pay at most $O \bigg( \sqrt{T \log\Big(\frac{2}{\delta}\Big)} + \delta \cdot \log_4\left(\frac{T}{c \cdot \log\left(\frac{2}{\delta}\right)}\right) \cdot T\bigg)$, which, picking $\delta = \frac{1}{\sqrt{T}}$ leads to an expected loss that is at most $O(\sqrt{T})$ up to logaritmic factors in $T$.
And here is the question: can we do better removing these extra logaritmic factors, maybe using a less naive strategy, to achieve the $O(\sqrt{T})$ rate?