Not on the VC path. Not even on the max-profit path. Just on the "Have fun doing cool research" path.
I was a mod on MJ for its first few years and got to know MJ's founder through discussions there. He already had "enough" money for himself from his prior sale of Leap Motion to do whatever he wanted. And, he decided what he wanted was to do cool research with fun people. So, he started MJ. Now he has far more money than before and what he wants to do with it is to have more fun doing more cool research.
Great for him, but when you mention research and fun, I have to say I'm not aware MJ published any research whatsoever.
And on the topic of fun, while it's certainly highly subjective, I remember that the moderation with the MJ tool was at one point so strict that you could not generate an image containing a "treasure chest" since they censored the word "chest".
I'm happy that state of the art models are now developed by actors who publish comprehensive technical reports and open-weights.
1. real time world models for the "holodeck". It has to be fast, high quality, and inexpensive for lots of users. They started on this two years ago before "world model" hype was even a thing.
2. some kind of hardware to support this.
David Holz talks about this on Twitter occasionally.
Midjourney still has incredible revenue. It's still the best looking image model, even if it's hard to prompt, can't edit, and has artifacting. Every generation looks like it came out of a magazine, which is something the other leading commercial models lack.
They have image and video models that are nowhere near SOTA on prompt adherence or image editing but pretty good on the artistic side. They lean in on features like reference images so objects or characters have a consistent look, biasing the model towards your style preferences, or using moodboards to generate a consistent style
A lot of people started realizing that it didn’t really matter how pretty the resulting image was if it completely failed to adhere to the prompt.
Even something like Flux.1 Dev which can be run entirely locally and was released back in August of 2024 has significantly better prompt understanding.
Yeah, though I there is the same issue the other way round: Great prompt understanding doesn't matter much when the result has an awfully ugly AI fake look to it.
That's definitely true, and the medium also really makes a big difference as well (photorealism, digital painting, watercolor, etc.).
Though in some cases, it is a bit easier to fix visual artifacts (using second-pass refiners, Img2Img, ultimate upscale, stylistic LoRAs, etc.) than a fundamental coherency problem.
I was disappointed when Imagen 4 (and therefore also Nano Banana Pro, which clearly uses Imagen 4 internally to some degree) had a significantly stronger tendency to drift from photorealism to AI fake aesthetics than Imagen 3. This suggests there is a tradeoff between prompt following and avoiding slop style. Perhaps this is also part of the reason why Midjourney isn't good at prompt following.
How is it a problem? There simply doesn't seem to be a moat or secret sauce. Who cares which of these models is SOTA? In two months there will be a new model.
Right, but that's a short term moat. If they pause on their incredible levels of spending for even 6 months, someone else will take over having spent only a tiny fraction of what they did. They might get taken over anyway.
By reverse engineering, sheer stupidity from the competition, corporate espionage, ‘stealing’ engineers and sometimes a stroke of genius, the same as it’s always been
They still have a niche. Their style references feature is their key differentiator now, but I find I can usually just drop some images of a MJ style into Gemini and get it to give me a text prompt that works just as well as MJ srefs.
The pace of commoditization in image generation is wild. Every 3-4 months the SOTA shifts, and last quarter's breakthrough becomes a commodity API.
What's interesting is that the bottleneck is no longer the model — it's the person directing it. Knowing what to ask for and recognizing when the output is good enough matters more than which model you use. Same pattern we're seeing in code generation.
There is a decent chance there will be no clear consensus... Maybe people going custom LoRas etc should publish for the 3x most common models. Or maybe the tooling will make it so that switching models in a workflow will be painfree, as has kind of happened with LLMs.
I'm happy the models are becoming commodity, but we still have a long way to go.
I want the ability to lean into any image and tweak it like clay.
I've been building open source software to orchestrate the frontier editing models (skip to halfway down), but it would be nice if the models were built around the software manipulation workflows: