The first 32 bytes of XSalsa20 output are used as key for the one-time-mac Poly1305. Poly 1305 needs a new 32 byte key for each message, using part of the key-stream is a natural way to obtain those.
Requiring those empty bytes makes implementing the API easier. The implementer only needs to call XSalsa20 on the zero padded input buffer once, receiving both the Poly1305 key and encrypting the message.
Without a padded input it's necessary to split the call to XSalsa20 into several calls, one to HSalsa, two to Salsa20 and a bit of copying. This is a bit annoying and incurs a minor performance hit.
IMO requiring zero padding is bad API design. This is an implementation detail that shouldn't be exposed to the user. APIs should be designed with the consumer in mind, not the implementer.