What known state of art techniques might ChatGPT-4o, Claude 3 or other similar systems be using to understand both text and image data? I noticed that ChatGPT-4o can recognize text in an image well. Might it be using an external OCR tool or has it learned to recognize characters neurally? What training sets and methods make sense for training such text+image systems?
For some image-related tasks, such as asking to color a given image ChatGPT-4o seems to be using a less interesting approach - it looks that it generates a program and runs it on the image, the output of an image is not generated by a NN.