0

I have a "fantasy language" (a conlang), which has a very simple pronunciation system. Every letter represents one sound, as opposed to English, where you can have the same sound with different spellings ("here" and "hear", for example). In the fantasy language, each letter gets a specific sound. In addition, 2-syllable words are emphasized on the second syllable, so there is consistent emphasis when pronouncing.

So what I was thinking is, I would theoretically be able to write out a sentence, such as:

suq stretxi, reC hagi slur yas xica

And then an AI bot could "read" those letters and pronounce the sounds. The question is, how would I wire this up?

I know I can't just have 1 sound clip per letter, because the letters blend into each other in subtle ways.

But is it possible to create a dataset of snippets of sounds somehow, and plug them into an existing AI voice generator or something like that, and it would just work? Ideally this would be possible in pure JavaScript so I could run it dynamically in the browser.

If it's not that simple, what would I need to do exactly to make this a reality?

I saw the pink trombone voice synthesizer (and source code), but I think that is probably too low level to be practical? Instead I am imagining doing it like how Siri seems to have been created, and splice up actual speaking (probably me myself doing the speaking) into labelled sound clips and have a database of sound clips to dynamically weave together. I'm just not sure what I would need to create exactly to make that happen, and how much work it would be.

I don't need to know down to the code how to do it, just looking for a high level and some details like how many recordings I might need or what technique would be best.

I plugged the question into ChatGPT to get some of the basics answered, so mainly looking for details on:

  • What I should record of myself speaking? And how much.
  • How to properly label the recordings?
  • How to use machine learning somehow to automate the splicing of the recordings I guess?
  • What I would then need to do (at a high level) to plug the labeled recordings into a TTS engine (or what it would take to write my own).
Lance Pollard
  • 75
  • 2
  • 9

1 Answers1

2

Given the regularity in the language, I suggest you create an equivalence between it and the International Phonetic Alphabet (IPA) pronunciation system. Then, you can just convert your text to IPA and give this as input to an IPA text-to-speech engine.

There are open-source text-to-speech engines that work with IPA, like eSpeak. You can try it online here.

noe
  • 28,203
  • 1
  • 49
  • 83