Most Efficient way to set Register to 1 or (-1) on original 8086

Question

I am taking an assembly course now, and the guy who checks our home assignments is a very pedantic old-school optimization freak. For example he deducts 10% if he sees:

mov ax, 0

instead of:

xor ax,ax

even if it's only used once.

I am not a complete beginner in assembly programing but I'm not an optimization expert, so I need your help in something (might be a very stupid question but I'll ask anyway): if I need to set a register value to 1 or (-1) is it better to use:

mov ax, 1

or do something like:

xor ax,ax
inc ax

I really need a good grade, so I'm trying to get it as optimized as possible. ( I need to optimize both time and code size)

Using the same instruction sequence in **every** context wont give you optimal speed **or** size. For (a crappy) example, what if `cx` is guaranteed to be `1` at the point where you need to set `ax` to `1`? You could just `mov ax, cx`. — L̲̳o̲̳̳n̲̳̳g̲̳̳p̲̳o̲̳̳k̲̳̳e̲̳̳, Jun 29 '10 at 20:59
Related, for modern x86-64: [Set all bits in CPU register to 1 efficiently](https://stackoverflow.com/q/45105164). But 8086 / 8088 was almost entirely about minimizing memory access, including code-fetch, since that was slower than processing time for most instructions. — Peter Cordes, Dec 24 '22 at 02:33

paxdiablo · Accepted Answer · 2022-03-04T20:22:36.340

15

A quick google for 8086 instructions timings size turned up a listing of instruction timings which seems to have all the timings and sizes for the 8086/8088 through Pentium.

Although you should note that this probably doesn't include code fetch memory bottlenecks which can be very significant, especially on an 8088. This usually makes optimization for code-size a better choice. See here for some details on this.

No doubt you could find official Intel documentation on the web with similar information, such as the "8086/8088 User's Manual: Programmer's and Hardware Reference".

For your specific question, the table below gives a comparison that indicates the latter is better (less cycles, and same space):

Instructions	Clock cycles	Bytes
xor ax, ax inc ax	3 3 --- 6	2 1 --- 3
mov ax, 1	4	3

But you might want to talk to your educational institute about this guy. A 10% penalty for a simple thing like that seems quite harsh. You should ask what should be done in the case where you have two possibilities, one faster and one shorter.

Then, once they've admitted that there are different ways to optimise code depending on what you're trying to achieve, tell them that what you're trying to do is optimise for readability and maintainability, and seriously couldn't give a damn about a wasted cycle or byte here or there⁽¹⁾.

Optimisation is something you generally do if and when you have a performance problem, after a piece of code is in a near-complete state - it's almost always wasted effort when the code is still subject to a not-insignificant likelihood of change.

For what it's worth, sub ax,ax appears to be on par with xor ax,ax in terms of clock cycles and size, so maybe you could throw that into the mix next time to cause him some more work.

_{⁽¹⁾No, don't really do that , but it's fun to vent occasionally :-)}

edited Mar 04 '22 at 20:22

answered May 13 '10 at 13:14

paxdiablo

854,327
234
1,573
1,953

@Bob, sorry mate, I made a mistake in leaving out the cost on the `inc ax` - it turns out the `mov ax,1` is actually short and faster (and more readable). – paxdiablo May 13 '10 at 13:39
our professor said something like: "I know that in most cases these optimizations are irrelevant and insignificant but you guys should know about them because someday you just might need to do one." and also something like "In my time you could really see the difference in performance" – Bob May 13 '10 at 13:43
@Bob: That would make sense if you developed your own compiler, I believe you wouldn't think of it solving other tasks. Compilers often do automatic optimization. – YasirA May 13 '10 at 13:49
1

`sub ax,ax` and `xor ax,ax` might seem similar, but modern processors know about `xor` not having a real dependency on `ax` value; it is not so certain with `sub`. – liori May 13 '10 at 14:26
@lion, that was specifically for the 8086, I don.t know if it had all that you-beaut stuff. But it seems to me that the dependencies and effects for xor ax,ax and sub ax,ax are exactly the same, as would be xor ax,N and sub ax,N where N is any type of object. – paxdiablo May 13 '10 at 21:56
For instruction timings on modern CPUs, see http://agner.org/optimize/. `xor` for zeroing a register instead of `mov reg, 0` has NO downsides (other than clearing the flags), so I think it's perfectly reasonable to know that idiom. You'll need to know it to read compiler output, or anyone else's code. `xor/inc` is slower than `mov reg, 1`, though, even though the `mov` takes more code size (for 32/64bit code). Both ways begin a new dependency chain, so can happen in parallel with other instructions, but xor/inc takes 2 Intel uops or 2 AMD macro-ops, while move-immediate only takes 1. – Peter Cordes Sep 19 '15 at 19:49
Note that the 8086 instruction timing tables published by Intel(?) don't include fetch bandwidth, i.e. they assume the machine code bytes are already in the prefetch buffer. See [Why is LOOP faster than DEC,JNZ on 8086?](https://stackoverflow.com/q/71117163) and [Increasing Efficiency of binary -> gray code for 8086](https://stackoverflow.com/q/67400133) for some examples of using the normal rule of thumb that performance = count memory accesses (including code fetch), multiply by 4. In this case they're the same size, so the slower xor/inc might not be a real bottleneck on 8086. – Peter Cordes Mar 04 '22 at 10:47
Also http://8086.tk/ is down; other sources for the info include https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html and the "8086/8088 User's Manual: Programmer's and Hardware Reference" (according to njuffa's answer; I didn't look for a PDF copy.) – Peter Cordes Mar 04 '22 at 10:49

score 3 · Answer 2 · answered Jun 29 '10 at 20:41

3

You're better off with

mov AX,1

on the 8086. If you're tracking register contents, you can possibly do better if you know that, for example, BX already has a 1 in it:

mov AX,BX

or if you know that AH is 0:

mov AL,1

etc.

answered Jun 29 '10 at 20:41

Walter Bright

4,277
1
23
28

score 2 · Answer 3 · answered May 13 '10 at 13:18

2

Depending upon your circumstances, you may be able to get away with ...

 sbb ax, ax

The result will either be 0 if the carry flag is not set or -1 if the carry flag is set.

However, if the above example is not applicable to your situation, I would recommend the

xor  ax, ax
inc  ax

method. It should satisfy your professor for size. However, if your processor employs any pipe-lining, I would expect there to be some coupling-like delay between the two instructions (I could very well be wrong on that). If such a coupling exists, the speed could be improved slightly by reordering your instructions slightly to have another instruction between them (one that does not use ax).

Hope this helps.

answered May 13 '10 at 13:18

Sparky

13,505
4
26
27

`sbb` is a nice feature. possibly with a preceding `stc` (set carry) – Joop Eggen Mar 04 '22 at 11:21
`sbb` is useful if the previous instructions left CF set. But `mov ax, 1` is strictly better than `xor ax,ax`/`inc ax`; same code size, fewer instructions/uops, fewer cycles on 8086 and all later CPUs. (And not better for partial-register reasons on P6 family; both end by writing a 16-bit register.) The situation is different in 32-bit mode, where the lack of a `mov r/m32, sign_extended_imm8` is a problem, unlike in 16-bit mode where needing a full imm16 is balanced by avoiding a ModRM, so xor/inc saves code size in 32-bit mode (at the cost of instructions). – Peter Cordes Dec 24 '22 at 02:46

score 2 · Answer 4 · answered May 13 '10 at 19:31

I would use mov [e]ax, 1 under any circumstances. Its encoding is no longer than the hackier xor sequence, and I'm pretty sure it's faster just about anywhere. 8086 is just weird enough to be the exception, and as that thing is so slow, a micro-optimization like this would make most difference. But any where else: executing 2 "easy" instructions will always be slower than executing 1, especially if you consider data hazards and long pipelines. You're trying to read a register in the very next instruction after you modify it, so unless your CPU can bypass the result from stage N of the pipeline (where the xor is executing) to to stage N-1 (where the inc is trying to load the register, never mind adding 1 to its value), you're going to have stalls.

Other things to consider: instruction fetch bandwidth (moot for 16-bit code, both are 3 bytes); mov avoids changing flags (more likely to be useful than forcing them all to zero); depending on what values other registers might hold, you could perhaps do lea ax,[bx+1] (also 3 bytes, even in 32-bit code, no effect on flags); as others have said, sbb ax,ax could work too in circumstances - it's also shorter at 2 bytes.

When faced with these sorts of micro-optimizations you really should measure the alternatives instead of blindly relying even on processor manuals.

P.S. New homework: is xor bx,bx any faster than xor bx,cx (on any processor)?

To your PS question: Yes it is. On modern processors a xor instruction on two identical registers gets special treatment by the CPU causing it to not have a false dependency on the previous value of the register which is faster and reduces the number of internal register the CPU needs to use. Some processors do not have this check for the sub instruction, so xor is preferable here. — fuz, Jul 31 '13 at 13:12
@Berd: `xor bx,bx` is 16-bit operand size, leaving the upper bytes of EBX unmodified. It's dependency-breaking on Intel P6-family, though, which renames partial-registers aggressively. But on Sandybridge-family, 16-bit `xor`-zeroing isn't special. But `xor ebx,ebx` [has many advantages](https://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and), e.g. not even needing an execution unit on Sandybridge, so it has higher throughput (4 per clock) than `xor ebx,ecx` (3 per clock not counting the dependency chain through EBX). — Peter Cordes, Mar 29 '18 at 14:46

Most Efficient way to set Register to 1 or (-1) on original 8086

4 Answers4