Given that the bottleneck on the embedded device is local non-interactive public-key signature verification, the best industry standard for that is RSA (with a standard signature padding, such as PKCS#1 RSASSA-PSS, PKCS#1 RSASSA-PKCS1-v1_5), which is usually significantly faster than ECDSA for signature verification including for the common $e=65537$; and for good implementations always faster when using $e=3$, which allows a speedup by a factor of about $8$. Rabin signature verification is nearly twice faster than RSA with $e=3$, and is also standard if not common, e.g. was in ANSI X9.31:1988 and is in ISO/IEC 9796-2:2010.
Note: the absolutely fastest seems to be Daniel J. Bernstein's A secure public-key signature system with extremely fast verification (2000); this is essentially Rabin with an expanded signature allowing extremely fast verification, using an idea he first outlined there.
Both RSA and Rabin are based on modular arithmetic modulo $N$ of secret factorization. The time for signature verification is dominated by $17$ (RSA, $e=65537$), $2$ (RSA, $e=3$), or just $1$ (Rabin, $e=2$) multiplication(s) modulo $N$, where $N$ has $n$ bits. $n=2048$ is acceptably secure till 2030 according to NIST and French ANSSI.
When appropriately implemented using standard (quadratic) algorithms working on $w$-bit words, the computation time for one multiplication modulo $N$ is dominated by $\approx(n/w)^2$ executions of an elementary operation consisting of
- two multiplications of two $w$-bit word giving a $2w$-bit result
- addition with carry of the corresponding two results into temporary values
- three reads of a $w$-bit word
- one write of a $w$-bit word
- on register-starved CPUs only, some read-writes for temporaries
Notoriously, careful optimization of the core loop is essential (assembly language shines!); and using the wrong algorithm will impact speed (in particular: separating modular multiplication from modular reduction increases the memory accesses; Montgomery arithmetic at best does not help).
Actual execution time can be in seconds on a mere 8-bit CPU (for 2048-bit RSA, $e=3$, an implementation I wrote verifies a signature in $1.25$s on a 8051 core with 5M cycle/s and 4-cycle multiplication of bytes giving 16-bit result, and no hardware 16-bit addition).
Execution time decreases about quadratically with the word size, allowing time in milliseconds for a modern 32-bit CPU (the question does not specify which core is used; ARM CPUs tend to be good at this, especially those with UMLAL and UMAAL).
Per eBACS benchmarks, on an ARM Cortex-A8, RSA-2048 ($e=3$) is timed at a median of 555418 cycles (28ms scaled to 20MHz), versus 2594303 cycles for one of the fastest elliptic-curve signature system, ed25519.