Training a Flux LoRA for blog heroes: three approaches that failed first
EN

Training a Flux LoRA for blog heroes: three approaches that failed first

Last verified: May 25, 2026
9min read
Case study
AI integration
UI/UX designer

Every blog post on this site needs a hero image, and for a while every hero image was a small argument with a diffusion model. The brief sounds trivial: one recognizable person, the site’s author, in a clean editorial scene that matches the article’s mood, 16:9, hundreds of times, in a recognizably consistent look. It took three failed approaches before the brief was actually met, and each failure taught something specific about where image generation breaks.

This is a build log, not a tutorial. The point is not “use a LoRA”; it is why the obvious cheaper options collapse exactly at the requirement that matters for an editorial blog: the same face, post after post.

#Consistent AI hero images: TL;DR in 4 points

  • A base text-to-image model gives you a competent stranger who changes face every render. Fine for one image, useless for a consistent author across a series.
  • A single-photo face reference (image-to-image) keeps the likeness but drifts under new lighting and angles, because it is interpolating from one example.
  • A Flux LoRA trained on six varied photos generalizes the identity and renders it into any scene from a trigger word. That is what finally held.
  • The bigger lesson is not about identity at all: heroes must convey mood, not literally illustrate the article. The moment a prompt asks for a laptop screen, the model produces uncanny pseudo-text. Stop asking.

#Glossary: diffusion, LoRA, trigger word, image-to-image

A few terms carry the whole story.

  • Diffusion model - the class of image generator (Flux, Imagen, Stable Diffusion) that starts from noise and denoises toward an image matching the prompt.
  • Text-to-image - generation from a prompt alone, no input image. Maximum freedom, zero identity control.
  • Image-to-image - generation conditioned on an input image, used here to carry a face from a reference photo into a new scene.
  • LoRA (low-rank adaptation) - a small trained add-on to a base model that teaches it one concept (here, a specific face) without retraining the whole model. Invoked with a trigger word.
  • Trigger word - a rare token (ours is MRZSZ) placed at the start of the prompt to activate the LoRA’s learned identity.
  • Aspect ratio - the hero slot is 16:9, so every image is generated at that ratio rather than cropped from a square.

#Approach one that failed: text-to-image gives you a stranger

The first instinct is the cheapest: describe the scene, let a text-to-image model render it. Google Imagen and Flux base both do this well at the level of a single picture. A man at a desk in warm light, shallow depth of field, looks professional and clean.

It fails the instant you generate the second one. The face is different. Not stylistically, structurally: a different person. Across a blog where the same author should anchor the visual identity from post to post, a gallery of competent strangers is worse than no people at all, because the inconsistency reads as carelessness. Text-to-image has no mechanism to hold an identity it was never given. This approach is still useful, but only for heroes that need no person at all: an abstract still life, a technical macro shot. For those, a text-to-image call is the right tool and nothing more is needed.

The requirement that killed it was never “a good image.” It was “the same person, two hundred times.”

#Approach two that failed: a face reference drifts

The obvious next step is image-to-image with a reference photo. Modern multimodal image models (Gemini’s image mode among them) take a photo of the subject and a scene prompt, and generate the new scene while trying to preserve the face. This is a real improvement: the likeness is broadly there.

It drifts. With one reference frame, the model is interpolating from a single example, so as the prompt pushes the lighting, angle, or distance away from that frame, the face quietly slides. Warm side-lighting subtly reshapes the jaw; a three-quarter angle softens features the reference never showed. Each individual image looks fine. Side by side across a series, the person is not quite the same person, and the uncanny near-miss is more distracting than an honest difference would be. You end up fighting the reference image on every generation, tuning strength values to trade likeness against scene freedom, and never fully winning either.

The lesson: one example preserves a likeness; it does not generalize an identity.

#Approach three that failed: a LoRA that renders screens

Training a dedicated LoRA fixed the identity problem cleanly. The model, mariusz-face-lora on Replicate, was trained on 2026-05-24 on six real photos chosen for variety in angle, light, and expression with clean backgrounds, and invoked with the trigger word MRZSZ at the start of every prompt. Six varied photos generalize the face far better than a larger, monotonous set, because variety is what teaches the model the identity rather than one room.

With identity solved, the third failure appeared, and it had nothing to do with faces. The early prompts tried to illustrate each article literally: the author at a laptop showing a security dashboard, a screen full of code, a chart on a monitor. Flux rendered the person perfectly and the screen as a hallucination. Diffusion models cannot produce coherent screen content; what comes out is glyph-shaped pseudo-text and charts with impossible geometry, and the eye catches it instantly. No prompt engineering fixes this, because the model has no concept of legible UI; it only knows what screens look like as texture.

So the literal-illustration instinct was the third thing to abandon.

#What actually worked: identity from a LoRA, scenes built on mood

The working formula has two halves. Identity comes from the LoRA: trigger word first, 16:9, one output per call, no reference image to manage. Scenes are built on mood, not literal keywords. A security article does not get a security dashboard; it gets a calm, analytical desk portrait in warm focused light. A performance article gets a different atmosphere, not a Lighthouse score on a screen. The props are chosen for what the model can render reliably: a closed laptop, a notebook, a coffee mug, a pen. Open screens, phones displaying apps, anything with text on a surface are kept out of frame.

This also made the pipeline programmatic. Articles are bucketed into clusters (ai, security, performance, headless, plugins, seo, tutorial, strategy), each cluster mapped to a mood scene template, and a backfill script can generate a consistent hero for any post from its cluster and the trigger word. Identity is constant by construction; mood varies by topic; nothing in the frame asks the model to do something it cannot. More build notes from this site live on the wppoland blog.

#How six photos become a stable identity

The surprising part of the training run was how few photos it took, and how much the selection mattered more than the count. Six images, chosen so that no two shared the same angle, lighting, or expression, and all with uncluttered backgrounds. The variety is the actual teaching signal: it tells the model which features are the person and which are incidental to one photo. A set of twelve near-identical headshots would have taught the model less, because it would have had no way to separate identity from the lighting of that single setup, and the face would bind to one room.

Two smaller choices carried weight. The trigger word MRZSZ is deliberately not a real word in any of the blog’s six languages; a rare token avoids colliding with vocabulary the base model already associates with other concepts, so activating the identity does not drag in unrelated associations. And clean backgrounds in the training photos keep the LoRA from learning a setting along with the face, which is what frees the prompt to place the same person in any scene afterward. None of this is exotic. It is the difference between a LoRA that generalizes and one that memorizes.

#The two-pipeline setup: a LoRA and a fallback

The LoRA did not retire the other tools; it took its proper place beside them. The site keeps two generation paths because not every hero needs a face. When the author should appear, the Replicate Flux LoRA renders the identity into a mood scene. When the article calls for an abstract or technical image with no person at all (a still life, a macro shot of hardware), a plain text-to-image call through Imagen is the cheaper, freer tool, at a few cents per image and no reference to manage. There is also an image-to-image face-reference path retained for the rare case where a specific real photo, not the generalized identity, is the right starting point.

The principle behind keeping all three is that each solves a different shape of problem, and forcing one tool to cover all of them is what produced the earlier failures. The decision tree is short: person needed and consistency matters, use the LoRA; no person, use text-to-image; one specific real frame, use image-to-image. Routing the request to the right path is most of the quality.

#When a LoRA is not worth it

The honest counterweight: training a face LoRA is overkill for low volume. If you need a handful of images a year, the per-image face-reference tools are simpler, need no training run, and the drift across three or four images is tolerable. The LoRA earns its training cost only when two conditions hold together: enough volume that per-image reference management becomes a chore, and a real need for one consistent identity across a series. An editorial blog with hundreds of posts and a single author face meets both. A landing page with three illustrations does not.

The general lesson outlasts the specific tools. Each failed approach failed at a different layer: text-to-image at identity, image-to-image at generalization, the first LoRA at the limits of what diffusion can draw. Picking the right tool meant naming which layer the requirement actually lived in. The requirement was never “make a nice image.” It was “the same person, in a believable scene, two hundred times,” and only the last approach was built for that sentence.

Next step

Turn the article into an actual implementation

This block strengthens internal linking and gives readers the most relevant next move instead of leaving them at a dead end.

Want this implemented on your site?

If visibility in Google and AI systems matters, I can build the content architecture, FAQ, schema, and internal linking needed for SEO, GEO, and AEO.

Related cluster

Explore other WordPress services and knowledge base

Strengthen your business with professional technical support in key areas of the WordPress ecosystem.

Why not just use text-to-image for blog hero images? #
Text-to-image with no reference produces a competent but generic person who changes face from one image to the next. For a one-off illustration that is fine. For an editorial blog where the same author should appear across hundreds of posts, identity consistency is the whole point, and a base text-to-image model cannot hold a single face across a series. You get a different stranger every time.
What is a Flux LoRA and why does it beat a face reference? #
A LoRA (low-rank adaptation) is a small set of trained weights that teaches a base diffusion model a specific concept, here one person's face, without retraining the whole model. Once trained, you invoke it with a trigger word and the model renders that face in any scene you prompt. A single-image face reference (image to image) preserves likeness from one photo but drifts under new lighting and angles, because it is interpolating from one example. A LoRA trained on several photos generalizes the identity instead of copying one frame.
How many photos do you need to train a face LoRA? #
The model behind this blog's heroes was trained on six real photos. The decisive factors are not raw count but variety: different angles, lighting, and expressions, with clean backgrounds so the training does not bind the identity to one room. Six varied photos held identity better than a larger but monotonous set would have.
Why do AI hero images look uncanny when they show laptop screens? #
Because diffusion models cannot render coherent screen content. Asked for a laptop showing a security dashboard, the model invents glyph-like pseudo-text and impossible chart shapes that read as wrong at a glance. The fix is not a better prompt; it is to stop asking. Heroes should convey mood, not literally illustrate the article. Closed laptops, notebooks, a coffee mug, and a pen are reliable props; open screens are not.
Is a trained LoRA worth it over per-image face-reference tools? #
For a large content operation, yes. The training is a one-time cost and every subsequent image is a single API call with a trigger word and a mood prompt, with no reference image to manage and no per-image likeness fight. For a handful of images a year, a face-reference image-to-image tool is simpler and cheaper. The break-even is volume and the need for a consistent identity across a series.

Need an FAQ tailored to your industry and market? We can build one aligned with your business goals.

Let’s discuss

Related Articles

AI translation in multilingual WordPress nails 99 percent of prose, then breaks the structural fields - slug, canonical URL, hreflang, taxonomy terms, redirect map. Operations report from 6 locales.
i18n

AI translation in WordPress: why it breaks multilingual SEO

AI translation in multilingual WordPress nails 99 percent of prose, then breaks the structural fields - slug, canonical URL, hreflang, taxonomy terms, redirect map. Operations report from 6 locales.

Recap from WordCamp Portugal 2026 in Porto: accessibility as an SEO signal, WordPress Abilities API, AI in core, Claude Code and the agency model shift.
community

WordCamp Portugal 2026: Porto, accessibility, Abilities API and AI agencies

Recap from WordCamp Portugal 2026 in Porto: accessibility as an SEO signal, WordPress Abilities API, AI in core, Claude Code and the agency model shift.

A practical walkthrough of building a Model Context Protocol server in front of WooCommerce. Tool definitions, catalogue and order endpoints, schema.org alignment, Zod validation, and a Cloudflare Workers deployment that an AI agent can talk to.
wordpress

Building an MCP server for WooCommerce: a practitioner's guide

A practical walkthrough of building a Model Context Protocol server in front of WooCommerce. Tool definitions, catalogue and order endpoints, schema.org alignment, Zod validation, and a Cloudflare Workers deployment that an AI agent can talk to.