Image Generation
Text-to-image
Generating images from a written prompt. Almost all modern image generation is built on diffusion models.
Image generation is the model creating a new image from scratch or from another image. Common variants:
- Text-to-image: give a prompt → get an image ("cat in rain…") - Image-to-image: convert one image's style (photo → oil painting) - Inpainting: change/fill a specific region of an image - Outpainting: extend image borders (wider angle) - ControlNet: guide generation with structural inputs like pose, edge maps, depth
Modern stack: a text encoder (CLIP or T5) converts the prompt to an embedding; a diffusion model conditioned on it denoises into an image. Common tools: Midjourney, DALL·E 3, Stable Diffusion, Flux, Imagen, Recraft.
"Write a prompt instead of using Photoshop" — now standard in design, marketing, and content production.
Like giving a recipe to a cook. "Mildly spicy, sour, red, hot soup" — the cook interprets and produces a dish in their own style. Ask twice and you get two dishes, both fitting the recipe. Image generation works the same way: prompt = recipe, model = cook.
You need a "winter sale" banner for an e-commerce site. Before: find stock photo (licensed, generic), edit in Photoshop, 2 hours. Now:
Midjourney: "snowy storefront window, warm yellow lights, sale sign, photorealistic, 16:9 banner, soft cinematic lighting"
30 seconds, 4 variants. Pick one, upscale, fix small details with inpainting. Total: 5 minutes, $0.10.
Same job was impossible in 2020 (GANs weren't that good), barely doable in 2023, standard practice in 2025.
- Ads, banners, social — quick variant generation
- Concept design — generate 50 variants in idea phase, pick the best
- Stock photo replacement — no licensing, exact framing you want
- Photo retouch via inpainting — change background, add/remove objects
- Character/avatar generation — games, profile pics, illustration
- Need consistency (same character in different poses) — models are still weak
- Images with accurate text — diffusion text rendering is poor
- Precise anatomy — hands, fingers, face proportions are tricky
- Real-time apps — one image takes ~5-30s
- Specific person/brand logo needed — model doesn't know it (need LoRA fine-tune)
Prompt = recipe, not a wish
Just 'cat' → generic. 'Photorealistic black cat sitting on velvet armchair, soft afternoon window light, 50mm lens, depth of field' = pro output. Always specify lighting, composition, style, camera angle.
Copyright question
Output copyright is murky: training data may contain copyrighted material. Read the TOS for commercial use — Midjourney, Adobe Firefly differ.
Don't expect identical results
Same prompt + different seed = different image. Pin the seed for reproducibility. For brand consistency use LoRA fine-tunes or reference images.