$\newcommand{ket}[1]{|#1\rangle}\newcommand{bra}[1]{\langle#1|}$
We can construct a block-encoding with optimal scale factor $\alpha = 2^n-1$ and using only a single ancilla qubit as follows:
Let $CR_y(\theta,k)$ be a controlled $Y$-rotation on the ancilla, on the condition that the computational basis state is $\ket{k}$ on the main $n$ qubits. In other words,
\begin{align*}
CR_y(\theta,k) \ket{k}\ket0 &= \cos(\theta) \ket{k}\ket{0} - \sin(\theta)\ket{k}\ket{1}\\
CR_y(\theta,k) \ket{k}\ket1 &= \sin(\theta) \ket{k}\ket{0} + \cos(\theta) \ket{k}\ket{1},
\end{align*}
and the action is identity on all other computational basis states.
The desired block-encoding is now obtained via the gate sequence $$W_A = \prod_{k=0}^{2^n-1} CR_y\left(\theta_k,k\right),$$ where we define $$\theta_k = \arccos\left(\frac{k}{2^n-1}\right)$$ for each integer $0\le k \le 2^n-1$.
A generic block-encoded matrix element is then given by
\begin{align*}
\bra r\bra 0 W_A \ket s \ket 0 &= \bra r\bra 0 \left(\prod_{k=0}^{2^n-1} CR_y\left(\theta_k,k\right) \right)\ket s \ket 0\\
& = \bra r\bra 0 CR_y(\theta_s,s) \ket s \ket 0\\
& = \left(\frac s{2^n-1}\right) \delta_{rs},
\end{align*}
as desired.
The operator $W_A$ is actually the optimal qubitized block-encoding of $A$, taking the block form $$W_A = \begin{pmatrix} A/\alpha & -\sqrt{1 - (A/\alpha)^2} \\ \sqrt{1-(A/\alpha)^2} & A/\alpha \end{pmatrix}.$$
The downside of this approach is that it requires $\mathcal O(2^n)$ controlled $Y$-rotations...