Couple of thoughts:<p>1. I’d wager that given their previous release history, this will be open‑weight within 3-4 weeks.<p>2. It looks like they’re following suit with other models like Z-Image Turbo (6B parameters) and Flux.2 Klein (9B parameters), aiming to release models that can run on much more modest GPUs. For reference, the original Qwen-Image is a 20B-parameter model.<p>3. This is a unified model (both image generation and editing), so there’s no need to keep separate Qwen-Image and Qwen-Edit models around.<p>4. The original Qwen-Image scored the highest among local models for image editing in my GenAI Showdown (6 out of 12 points), and it also ranked very highly for image generation (4 out of 12 points).<p><i>Generative Comparisons of Local Models:</i><p><a href="https://genai-showdown.specr.net/?models=fd,hd,kd,qi,f2d,zt" rel="nofollow">https://genai-showdown.specr.net/?models=fd,hd,kd,qi,f2d,zt</a><p><i>Editing Comparison of Local Models:</i><p><a href="https://genai-showdown.specr.net/image-editing?models=kxd,og2,qe,f2d" rel="nofollow">https://genai-showdown.specr.net/image-editing?models=kxd,og...</a><p>I'll probably be waiting until the local version drops before adding Qwen-Image-2 to the site.
I've seen many comments describing the "horse riding man" example as extremely bizarre (which it actually is), so I'd like to provide some background context here. The "horse riding man" is a Chinese internet meme originating from an entertainment awards ceremony, when the renowned host Tsai Kang-yong wore an elaborate outfit featuring a horse riding on his back[1]. At the time, he was embroiled in a rumor about his unpublicized homosexual partner, whose name sounded "Ma Qi Ren" which coincidentally translates to "horse riding man" in Mandarin. This incident spread widely across Chinese internet and turned into a meme. So they used "horse riding man" as an example isn't entirely nonsensical, though the image per se is undeniably bizarre and carries an unsettling vibe.<p>[1] The photo of the outfit: <a href="https://share.google/mHJbchlsTNJ771yBa" rel="nofollow">https://share.google/mHJbchlsTNJ771yBa</a>
Interesting background! Prompts like this also test the latent space of the image generator - it’s usually the other way round, so if you see a man on top of a horse, you’ve got a less sophisticated embedding feeding the model. In this case, though, that’s quite an image to put out to the interwebs. I looked to see what gender the horse was.<p>EDIT: After reading the prompt translation, this was more just like a “year of the horse is going to nail white engineers in glorious rendered detail” sort of prompt. I don’t know how SD1.5 would have rendered it, and I think I’ll skip finding out
On the topic of modern Chinese culture, Is there the same hostility towards AI generated Imagery in China as there seems to be in America?<p>For example I think there would be a lot of businesses in the US that would be too afraid of backlash to use AI generated imagery for an itinerary like the one at <a href="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen-Image/image2/3.png" rel="nofollow">https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwe...</a>
Since China has a population of 1.4 billion people with vastly differing levels of cognition, I find it difficult to claim I can summarize "modern Chinese culture". But within my range of observation, no. Chinese not only have no hostility toward AI but actively pursues and reveres it with fervor. They widely perceive AI as an advanced force, a new opportunity for everyone, a new avenue for making money, and a new chance to surpass others. At most, some of the consumers might associate businesses using AI generated content with a budget-conscious brand image, but not hostile.
There's also the "horse riding astronaut" challenge in image generation: <a href="https://garymarcus.substack.com/p/horse-rides-astronaut-redux" rel="nofollow">https://garymarcus.substack.com/p/horse-rides-astronaut-redu...</a>
Why not ask for simply a man or even an Han man given the race of Tsai Kang-yong. Why a white man and why a man wearing medieval clothing. Gives your head a wobble.
The "horse riding man" prompt is wild:<p>"""A desolate grassland stretches into the distance, its ground dry and cracked. Fine dust is kicked up by vigorous activity, forming a faint grayish-brown mist in the low sky. Mid-ground, eye-level composition: A muscular, robust adult brown horse stands proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man. Its hind legs are taut, its neck held high, its mane flying against the wind, its nostrils flared, and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male, 30-40 years old, his face covered in dust and sweat, his short, messy dark brown hair plastered to his forehead, his thick beard slightly damp; he wears a badly worn, grey-green medieval-style robe, the fabric torn and stained with mud in several places, a thick hemp rope tied around his waist, and scratched ankle-high leather boots; his body is in a push-up position—his palms are pressed hard against the cracked, dry earth, his knuckles white, the veins in his arms bulging, his legs stretched straight back and taut, his toes digging into the ground, his entire torso trembling slightly from the weight. The background is a range of undulating grey-blue mountains, their outlines stark, their peaks hidden beneath a low-hanging, leaden-grey, cloudy sky. The thick clouds diffuse a soft, diffused light, which pours down naturally from the left front at a 45-degree angle, casting clear and voluminous shadows on the horse's belly, the back of the man's hands, and the cracked ground. The overall color scheme is strictly controlled within the earth tones: the horsehair is warm brown, the robe is a gradient of gray-green-brown, the soil is a mixture of ochre, dry yellow earth, and charcoal gray, the dust is light brownish-gray, and the sky is a transition from matte lead gray to cool gray with a faint glow at the bottom of the clouds. The image has a realistic, high-definition photographic quality, with extremely fine textures—you can see the sweat on the horse's neck, the wear and tear on the robe's warp and weft threads, the skin pores and stubble, the edges of the cracked soil, and the dust particles. The atmosphere is tense, primitive, and full of suffocating tension from a struggle of biological forces."""
It's crazy to think there was a fleeting sliver of time during which Midjourney felt like the pinnacle of image generation.
The pace of commoditization in image generation is wild. Every 3-4 months the SOTA shifts, and last quarter's breakthrough becomes a commodity API.<p>What's interesting is that the bottleneck is no longer the model — it's the person directing it. Knowing what to ask for and recognizing when the output is good enough matters more than which model you use. Same pattern we're seeing in code generation.
What ever happend to midjourney?
They still have a niche. Their style references feature is their key differentiator now, but I find I can usually just drop some images of a MJ style into Gemini and get it to give me a text prompt that works just as well as MJ srefs.
No external funding raised. They're not on the VC path, so no need to chase insane growth. They still have around 500M USD in ARR.<p>In my (very personal) opinion, they're part of a very small group of organizations that sell inference under a sane and successful business model.
A lot of people started realizing that it didn’t really matter how pretty the resulting image was if it completely failed to adhere to the prompt.<p>Even something like Flux.1 Dev which can be run entirely locally and was released back in August of 2024 has significantly better prompt understanding.
They have image and video models that are nowhere near SOTA on prompt adherence or image editing but pretty good on the artistic side. They lean in on features like reference images so objects or characters have a consistent look, biasing the model towards your style preferences, or using moodboards to generate a consistent style
Not much, while everything happened at OpenAI/Google/Chinese companies. And that's the problem.
My response to the horse image:
<a href="https://i.postimg.cc/hG8nJ4cv/IMG-5289-copy.jpg" rel="nofollow">https://i.postimg.cc/hG8nJ4cv/IMG-5289-copy.jpg</a>
I recently tried out LMStudio on Linux for local models. So easy to use!<p>What Linux tools are you guys using for image generation models like Qwen's diffusion models, since LMStudio only supports text gen.
Practically anybody actually creating with this class of models (diffusion based mostly) is using ComfyUI.
Community takes care of quantization, repackaging into gguf (most popular) and even speed optimizing (lighting loras, layers skip). It's quite extensive
I encourage everyone to at least try ComfyUI. It's come a long way in terms of user-friendliness particularly with all of the built-in Templates you can use.
Everything keeps changing so quickly, I basically have my own Python HTTP server with a unified JSON interface, then that can be routed to any of the impls/*.py files for the actual generation, then I have of those per implementation/architecture basically. Mostly using `diffusers` for the inference, which isn't the fastest, but tends to have the new model architectures much sooner than everyone else.
If you're on an AMD platform Lemonade (<a href="https://lemonade-server.ai/" rel="nofollow">https://lemonade-server.ai/</a>) added image generation in version 9.2 (<a href="https://github.com/lemonade-sdk/lemonade/releases/tag/v9.2.0" rel="nofollow">https://github.com/lemonade-sdk/lemonade/releases/tag/v9.2.0</a>).
ComfyUI is the best for stable diffusion
FWIW you can use non-sd models in ComfyUI too, the ecosystem is pretty huge and supports most of the "mainstream" models, not only the stable diffusion ones, even video models and more too.
I have my own MIT licensed framework/UI: <a href="https://github.com/runvnc/mindroot" rel="nofollow">https://github.com/runvnc/mindroot</a>. With Nano Banana via runvnc/googleimageedit
Ollama is working on adding image generation but its not here yet. We really do need something that can run a variety of models for images.
Yeah, I'm guessing they were bound to leave behind the whole "Get up and running with large language models" mission sooner or later, which was their initial focus, as investors after 2-3 years start making you to start thinking about expansion and earning back the money.<p>Sad state of affairs and seems they're enshittifying quicker than expected, but was always a question of when, not if.
Koboldcpp has built in support for image models. Model search and download, one executable to run, UI, OpenAI API endpoint, llama.cpp endpoint, highly configurable. If you want to get up and running instantly, just pick a kcppt file and open that and it will download everything you need and load it for you.<p>Engine:<p>* <a href="https://github.com/LostRuins/koboldcpp/releases/latest/" rel="nofollow">https://github.com/LostRuins/koboldcpp/releases/latest/</a><p>Kcppt files:<p>* <a href="https://huggingface.co/koboldcpp/kcppt/tree/main" rel="nofollow">https://huggingface.co/koboldcpp/kcppt/tree/main</a>
I found the horse revenge-porn image at the end quite disturbing.
It's the year of the horse in their zodiac. The (translated) prompt is wild:<p>"""
A desolate grassland stretches into the distance, its ground dry and cracked. Fine dust is kicked up by vigorous activity, forming a faint grayish-brown mist in the low sky. Mid-ground, eye-level composition: A muscular, robust adult brown horse stands proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man. Its hind legs are taut, its neck held high, its mane flying against the wind, its nostrils flared, and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male, 30-40 years old, his face covered in dust and sweat, his short, messy dark brown hair plastered to his forehead, his thick beard slightly damp; he wears a badly worn, grey-green medieval-style robe, the fabric torn and stained with mud in several places, a thick hemp rope tied around his waist, and scratched ankle-high leather boots; his body is in a push-up position—his palms are pressed hard against the cracked, dry earth, his knuckles white, the veins in his arms bulging, his legs stretched straight back and taut, his toes digging into the ground, his entire torso trembling slightly from the weight. The background is a range of undulating grey-blue mountains, their outlines stark, their peaks hidden beneath a low-hanging, leaden-grey, cloudy sky. The thick clouds diffuse a soft, diffused light, which pours down naturally from the left front at a 45-degree angle, casting clear and voluminous shadows on the horse's belly, the back of the man's hands, and the cracked ground. The overall color scheme is strictly controlled within the earth tones: the horsehair is warm brown, the robe is a gradient of gray-green-brown, the soil is a mixture of ochre, dry yellow earth, and charcoal gray, the dust is light brownish-gray, and the sky is a transition from matte lead gray to cool gray with a faint glow at the bottom of the clouds. The image has a realistic, high-definition photographic quality, with extremely fine textures—you can see the sweat on the horse's neck, the wear and tear on the robe's warp and weft threads, the skin pores and stubble, the edges of the cracked soil, and the dust particles. The atmosphere is tense, primitive, and full of suffocating tension from a struggle of biological forces.
"""
I think they call it "horse riding a human" which could have taken two very different directions, and the direction the model seems to have taken was the least worst of the two.
At first I thought it's a clever prompt because you see which direction the model takes it, and whether it "corrects" it to the more common "human riding a horse" similar to the full wine glass test.<p>But if you translate the actual prompt the term riding doesn't even appear. The prompt describes the exact thing you see in excruciating detail.<p>"... A muscular, robust adult brown horse standing proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man ... and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male, 30-40 years old, his face covered in dust and sweat ... his body is in a push-up position—his palms are pressed hard against the cracked, dry earth, his knuckles white, the veins in his arms bulging, his legs stretched straight back and taut, his toes digging into the ground, his entire torso trembling slightly from the weight ..."
> But if you translate the actual prompt the term riding doesn't even appear. The prompt describes the exact thing you see in excruciating detail.<p>Yeah, as they go through their workflow earlier in the blog post, that prompt they share there seems to be generated by a different input, then that prompt is passed to the actual model. So the workflow is something like "User prompt input -> Expand input with LLMs -> Send expanded prompt to image model".<p>So I think "human riding a horse" is the user prompt, which gets expanded to what they share in the post, which is what the model actually uses. This is also how they've presented all their previous image models, by passing user input through a LLM for "expansion" first.<p>Seems poorly thought out not to make it 100% clear what the actual humanly-written prompt is though, not sure why they wouldn't share that upfront.
Is it related to "Mr Hands" ?
Wont someone think of the horses.
I use gen-AI to produce images daily, but honestly the infographics are 99% terrible.<p>LinkedIn is filled with them now.
To be fair it hasn't made LinkedIn any worse than it already was.
Informatics are as bad as the author allows though. There's few people who could make or even describe a good infographic, so that's what we see in the results too.
Infographics and full presentations are a NanoBananaPro exclusive so far.
Correct.<p>Much like the pointless ASCII diagrams in GitHub readmes (big rectangle with bullet points flows to another...), the diagrams are cognitive slurry.<p>See Gas Town for non-Qwen examples of how bad it can get:<p><a href="https://news.ycombinator.com/item?id=46746045">https://news.ycombinator.com/item?id=46746045</a><p>(Not commenting on the other results of this model outside of diagramming.)
The Chinese vertical typography is sadly a bit off. If punctuation marks are used at all, they should be the characters specifically designed for vertical text, like ︒(U+FE12 PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP).
Cool, an alternate facts model.
unfortunately no open weights it seems.
So, I've just gave it this prompt:<p><i>"Analyze this webpage: <a href="https://en.wikipedia.org/wiki/1989_Tiananmen_Square_protests_and_massacre" rel="nofollow">https://en.wikipedia.org/wiki/1989_Tiananmen_Square_protests...</a><p>Generate an infographic with all the data about the main event timeline and estimated number of victims.<p>The background image should be this one: <a href="https://en.wikipedia.org/wiki/Tank_Man#/media/File" rel="nofollow">https://en.wikipedia.org/wiki/Tank_Man#/media/File</a> :Tank_Man_(Tiananmen_Square_protester).jpg<p>Improve the background image clarity and resolution."</i><p>I've received an error:<p><i>"Oops! There was an issue connecting to Qwen3-Max.
Content Security Warning: The input file data may contain inappropriate content."</i><p>I wonder if locally running the model they published in December does have the same censorship in place (i.e. if it's already trained like this), or if they implement the censorship by the Chinese regimen in place for the web service only.
interesting riding application picture
Why is the only image featuring non-Asian men the one under the horse?
> Qwen-Image-2.0 not only accurately models the “riding” action but also meticulously renders the horse’s musculature and hair
> <a href="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen-Image/image2/12.png#center" rel="nofollow">https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwe...</a><p>What the actual fuck
For reference, below is the prompt translated (with my highlighting of the part that matters). They did very much ask for this version of "horse riding a man", not the "horse sitting upright on a crawling human" version<p>---<p>A desolate grassland stretches into the distance, its ground dry and cracked. Fine dust is kicked up by vigorous activity, forming a faint grayish-brown mist in the low sky.<p>Mid-ground, eye-level composition: <i>A muscular, robust adult brown horse stands proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man</i>. Its hind legs are taut, its neck held high, its mane flying against the wind, its nostrils flared, and its eyes sharp and focused, exuding a primal sense of power. <i>The subdued man is a white male, 30-40 years old</i>, his face covered in dust and sweat, his short, messy dark brown hair plastered to his forehead, his thick beard slightly damp; he wears a badly worn, grey-green medieval-style robe, the fabric torn and stained with mud in several places, a thick hemp rope tied around his waist, and scratched ankle-high leather boots; <i>his body is in a push-up position</i>—his palms are pressed hard against the cracked, dry earth, his knuckles white, the veins in his arms bulging, his legs stretched straight back and taut, his toes digging into the ground, his entire torso trembling slightly from the weight.<p>The background is a range of undulating grey-blue mountains, their outlines stark, their peaks hidden beneath a low-hanging, leaden-grey, cloudy sky. The thick clouds diffuse a soft, diffused light, which pours down naturally from the left front at a 45-degree angle, casting clear and voluminous shadows on the horse's belly, the back of the man's hands, and the cracked ground.<p>The overall color scheme is strictly controlled within the earth tones: the horsehair is warm brown, the robe is a gradient of gray-green-brown, the soil is a mixture of ochre, dry yellow earth, and charcoal gray, the dust is light brownish-gray, and the sky is a transition from matte lead gray to cool gray with a faint glow at the bottom of the clouds.<p>The image has a realistic, high-definition photographic quality, with extremely fine textures—you can see the sweat on the horse's neck, the wear and tear on the robe's warp and weft threads, the skin pores and stubble, the edges of the cracked soil, and the dust particles. The atmosphere is tense, primitive, and full of suffocating tension from a struggle of biological forces.
I like how sometimes I get angry at a LLM for not understanding what I meant, but then I realize that I just forgot to mention it in the context. It's fun to see the same thing happen in humans reading websites too, where they don't understand the context yet react with strong feelings anyways.
The text rendering is quite impressive, but is it just me or do all these generated 'realistic' images have a distinctly uncanny feel to it. I can't quite put my finger on it what it is, but they just feel off to me.
I agree. They makes me nauseous. The same kind of light nausea as car sickness.<p>I assume our brains are used to stuff which we dont notice conciously, and reject very mild errors. I've stared at the picture a bit now and the finger holding the baloon is weird. The out of place snowman feels weird. If you follow the background blur around it isnt at the same depth everywehere. Everything that reflects, has reflections that I cant see in the scene.<p>I dont feel good staring at it now so I had to stop.
The lighting is wrong, that's what's telling to me. They look too crisp. No proper shadows, everything looks crystal clear.
Everything is weightless. When real people stand and gesture there’s natural muscle use, hair and clothing drape, papers lay flat on surfaces.
At least for the real life pictures, there’s no depth of field. Everything is crystal clear like it’s composited.
Which is pretty amusing - because it's the exact opposite problem that BFL had with the original Flux model - every single image looked like it was taken with a 200mm f/4.
> like its composited<p>Like focus stacking, specifically.<p>I’m always surprised when people bother to point out more-subtle flaws in AI images as “tells”, when the “depth-of-field problem” is so easily spotted, and has been there in every AI image ever since the earliest models.
I had no problems getting images with blurry background with the appropriate prompts. Something like "shallow depth of fields, bokeh, DSLR" can lead to good results. <a href="https://cdn.discordapp.com/attachments/1180506623475720222/1456789321091649769/image.png?ex=698bbd47&is=698a6bc7&hm=d949f70e4ac8b0173f8addc6947a1bade5482b54e592712da1c1950166f30f42&" rel="nofollow">https://cdn.discordapp.com/attachments/1180506623475720222/1...</a> [0]<p>But I found that that results in more professional looking images, and not more realistic photos.<p>Adding something like "selfy, Instagram, low resolution, flash" can lead to a .. worse image that looks more realistic.<p>[0] I think I did this one with z image turbo on my 4060 ti
The blur isn't correct though. Like the amount of blur is wrong for the distance, zoom amount etc. So the depth of field is really wrong even if it conforms to "subject crisp, background blurred"
Every photoreal image on the demo page has depth of field, it’s just subtle.
Qwen always suffered from their subpar rope implementation and qwen 2 seems to suffer from it as well. The uncanny feel is down to the sparsity of text to image token and the higher in resolution you go the worse it gets. It's why you can't take the higher ends of the MP numbers serious no matter the model. At the moment there is no model that can go for 4k without problems you will always get high frequency artifacts.
Agree, looks like the same effect they are applying on YouTube Shorts...
For me the only model that can really generate realistic images is nano banana pro (also known as gemini-3-pro-image). Other models are closing the gap, this one is pretty meh in my opinion in realistic images.
The complex prompt following ability and editing is seriously impressive here. They don't seem to be much behind OpenAI and Google. Which is backed op by the AI Arena ranking.
[dead]
when the horsey tranq hits
Another closed model dressed up as "coming soon" open source. The pattern is obvious: generate hype with a polished demo, lock the weights, then quietly move on. Real open source doesn't need a press release countdown.
That's not what they did with Qwen-Image v1 - they announced it and it was available via API, but then they released the weights a few weeks after with an Apache 2.0 license. Let's at least give them the benefit of the doubt here.
Good that we have the arbitrator of what "real open source" is and isn't over here.
“Open source” is indeed an objective standard with actual criteria, and not just vibes.<p>Luckily, it seems previous Qwen models did get open-sourced in the actual sense, so this one probably will be, too.
Where do you see a press release countdown? Alibaba consistently doesn't release weights for their biggest models, but they also don't pretend that they do.