Every blog post on this site needs a hero image, and for a while every hero image was a small argument with a diffusion model. The brief sounds trivial: one recognizable person, the site’s author, in a clean editorial scene that matches the article’s mood, 16:9, hundreds of times, in a recognizably consistent look. It took three failed approaches before the brief was actually met, and each failure taught something specific about where image generation breaks.
This is a build log, not a tutorial. The point is not “use a LoRA”; it is why the obvious cheaper options collapse exactly at the requirement that matters for an editorial blog: the same face, post after post.
Consistent AI hero images: TL;DR in 4 points
- A base text-to-image model gives you a competent stranger who changes face every render. Fine for one image, useless for a consistent author across a series.
- A single-photo face reference (image-to-image) keeps the likeness but drifts under new lighting and angles, because it is interpolating from one example.
- A Flux LoRA trained on six varied photos generalizes the identity and renders it into any scene from a trigger word. That is what finally held.
- The bigger lesson is not about identity at all: heroes must convey mood, not literally illustrate the article. The moment a prompt asks for a laptop screen, the model produces uncanny pseudo-text. Stop asking.
Glossary: diffusion, LoRA, trigger word, image-to-image
A few terms carry the whole story.
- Diffusion model - the class of image generator (Flux, Imagen, Stable Diffusion) that starts from noise and denoises toward an image matching the prompt.
- Text-to-image - generation from a prompt alone, no input image. Maximum freedom, zero identity control.
- Image-to-image - generation conditioned on an input image, used here to carry a face from a reference photo into a new scene.
- LoRA (low-rank adaptation) - a small trained add-on to a base model that teaches it one concept (here, a specific face) without retraining the whole model. Invoked with a trigger word.
- Trigger word - a rare token (ours is
MRZSZ) placed at the start of the prompt to activate the LoRA’s learned identity. - Aspect ratio - the hero slot is 16:9, so every image is generated at that ratio rather than cropped from a square.
Approach one that failed: text-to-image gives you a stranger
The first instinct is the cheapest: describe the scene, let a text-to-image model render it. Google Imagen and Flux base both do this well at the level of a single picture. A man at a desk in warm light, shallow depth of field, looks professional and clean.
It fails the instant you generate the second one. The face is different. Not stylistically, structurally: a different person. Across a blog where the same author should anchor the visual identity from post to post, a gallery of competent strangers is worse than no people at all, because the inconsistency reads as carelessness. Text-to-image has no mechanism to hold an identity it was never given. This approach is still useful, but only for heroes that need no person at all: an abstract still life, a technical macro shot. For those, a text-to-image call is the right tool and nothing more is needed.
The requirement that killed it was never “a good image.” It was “the same person, two hundred times.”
Approach two that failed: a face reference drifts
The obvious next step is image-to-image with a reference photo. Modern multimodal image models (Gemini’s image mode among them) take a photo of the subject and a scene prompt, and generate the new scene while trying to preserve the face. This is a real improvement: the likeness is broadly there.
It drifts. With one reference frame, the model is interpolating from a single example, so as the prompt pushes the lighting, angle, or distance away from that frame, the face quietly slides. Warm side-lighting subtly reshapes the jaw; a three-quarter angle softens features the reference never showed. Each individual image looks fine. Side by side across a series, the person is not quite the same person, and the uncanny near-miss is more distracting than an honest difference would be. You end up fighting the reference image on every generation, tuning strength values to trade likeness against scene freedom, and never fully winning either.
The lesson: one example preserves a likeness; it does not generalize an identity.
Approach three that failed: a LoRA that renders screens
Training a dedicated LoRA fixed the identity problem cleanly. The model, mariusz-face-lora on Replicate, was trained on 2026-05-24 on six real photos chosen for variety in angle, light, and expression with clean backgrounds, and invoked with the trigger word MRZSZ at the start of every prompt. Six varied photos generalize the face far better than a larger, monotonous set, because variety is what teaches the model the identity rather than one room.
With identity solved, the third failure appeared, and it had nothing to do with faces. The early prompts tried to illustrate each article literally: the author at a laptop showing a security dashboard, a screen full of code, a chart on a monitor. Flux rendered the person perfectly and the screen as a hallucination. Diffusion models cannot produce coherent screen content; what comes out is glyph-shaped pseudo-text and charts with impossible geometry, and the eye catches it instantly. No prompt engineering fixes this, because the model has no concept of legible UI; it only knows what screens look like as texture.
So the literal-illustration instinct was the third thing to abandon.
What actually worked: identity from a LoRA, scenes built on mood
The working formula has two halves. Identity comes from the LoRA: trigger word first, 16:9, one output per call, no reference image to manage. Scenes are built on mood, not literal keywords. A security article does not get a security dashboard; it gets a calm, analytical desk portrait in warm focused light. A performance article gets a different atmosphere, not a Lighthouse score on a screen. The props are chosen for what the model can render reliably: a closed laptop, a notebook, a coffee mug, a pen. Open screens, phones displaying apps, anything with text on a surface are kept out of frame.
This also made the pipeline programmatic. Articles are bucketed into clusters (ai, security, performance, headless, plugins, seo, tutorial, strategy), each cluster mapped to a mood scene template, and a backfill script can generate a consistent hero for any post from its cluster and the trigger word. Identity is constant by construction; mood varies by topic; nothing in the frame asks the model to do something it cannot. More build notes from this site live on the wppoland blog.
How six photos become a stable identity
The surprising part of the training run was how few photos it took, and how much the selection mattered more than the count. Six images, chosen so that no two shared the same angle, lighting, or expression, and all with uncluttered backgrounds. The variety is the actual teaching signal: it tells the model which features are the person and which are incidental to one photo. A set of twelve near-identical headshots would have taught the model less, because it would have had no way to separate identity from the lighting of that single setup, and the face would bind to one room.
Two smaller choices carried weight. The trigger word MRZSZ is deliberately not a real word in any of the blog’s six languages; a rare token avoids colliding with vocabulary the base model already associates with other concepts, so activating the identity does not drag in unrelated associations. And clean backgrounds in the training photos keep the LoRA from learning a setting along with the face, which is what frees the prompt to place the same person in any scene afterward. None of this is exotic. It is the difference between a LoRA that generalizes and one that memorizes.
The two-pipeline setup: a LoRA and a fallback
The LoRA did not retire the other tools; it took its proper place beside them. The site keeps two generation paths because not every hero needs a face. When the author should appear, the Replicate Flux LoRA renders the identity into a mood scene. When the article calls for an abstract or technical image with no person at all (a still life, a macro shot of hardware), a plain text-to-image call through Imagen is the cheaper, freer tool, at a few cents per image and no reference to manage. There is also an image-to-image face-reference path retained for the rare case where a specific real photo, not the generalized identity, is the right starting point.
The principle behind keeping all three is that each solves a different shape of problem, and forcing one tool to cover all of them is what produced the earlier failures. The decision tree is short: person needed and consistency matters, use the LoRA; no person, use text-to-image; one specific real frame, use image-to-image. Routing the request to the right path is most of the quality.
When a LoRA is not worth it
The honest counterweight: training a face LoRA is overkill for low volume. If you need a handful of images a year, the per-image face-reference tools are simpler, need no training run, and the drift across three or four images is tolerable. The LoRA earns its training cost only when two conditions hold together: enough volume that per-image reference management becomes a chore, and a real need for one consistent identity across a series. An editorial blog with hundreds of posts and a single author face meets both. A landing page with three illustrations does not.
The general lesson outlasts the specific tools. Each failed approach failed at a different layer: text-to-image at identity, image-to-image at generalization, the first LoRA at the limits of what diffusion can draw. Picking the right tool meant naming which layer the requirement actually lived in. The requirement was never “make a nice image.” It was “the same person, in a believable scene, two hundred times,” and only the last approach was built for that sentence.



