Cyclops X-Ray: DALL-E 3 failed in what DALL-E 2 had already achieved, the single eye [prompts: Nei Bomfim/ Immersera; post-production: ReNascimento;]
How do AIs Generate Visuals? Know Essentials and Face 0.1-Sec-Positioning-or-Scrolling
BMW proves: ignoring how Gen AIs — based on cat vision, for example — can create scroll-stopping visuals is as unthinkable as ignoring the ABCs of the Internet
Hooking audiences in 0.1 seconds (Missouri Univ) has never been such a life-or-death matter as in our era, which we can call AImmersive. The fact is, brands online face a relentless challenge: the 0.1-Sec-Positioning-or-Scrolling — which is what I’ve dubbed the challenge of capturing desired audiences in a tenth of a second (or being ignored, the far more likely alternative).
And what other agent do we have to brake this 0.1-Sec-Scrolling if not visuals, since our brain deciphers image elements simultaneously, while with text or audio, it’s phoneme after phoneme? If 70% of all our sensors are concentrated in our eyes? Thus, relying on Gen AIs to generate visuals calibrated for 0.1-Sec-Positioning is very, very good.
However, Gen AIs rely on different architectures — and these, in turn, are based on everything from cats’ visual cortex to intentional image scrambling. These architectures, along with other technologies in these models, determine very different strengths and weaknesses between the Gen Is and equally varied profiles of their visual deliveries.
All of this makes ignoring the essentials of a Gen AI as unthinkable as it is in relation to the internet. This lack of knowledge can motivate a tragic equation: F = C x I x E , where the progressive Complexity (C) of Gen AI architectures, multiplied by the Ignorance (I) of their essentials, equals a Frustration (F) rate directly proportional to the Extraordinariness (E) of those possibilities generated precisely by Complexity.
1.We Are a Visual Species
This mastery of the essentials of base architectures isn’t just nerd charm: it’s a key to creativity and optimization (and this, because they are indeed charming in themselves, whether you’re a nerd or not). It’s no coincidence that a BMW campaign to envision futuristic cars opted not for a specific Gen AI, but for one of these base architectures (GAN — see more below).
Starting from the beginning, check out the data I’ve gathered that conclusively demonstrates why and how we humans are visually oriented:
- Our eyes concentrate about 70% of all body sensors;
- 40% of our cortex is involved in processing/understanding visual information;
- 80% of all information reaches us through our eyes;
- We react first to visuals — and retain them; 93% of communication is non-verbal;
- Images go straight to long-term memory; words go to short-term memory, which retains only about 7 bits;
- Our brain deciphers image elements simultaneously; with text, especially if in audio, it’s phoneme after phoneme, word after word, or, if sounds, note after note;
- Cognitively, visuals accelerate and increase comprehension, recall, retention, and decoding of text;
- Emotionally, they stimulate other areas of the brain, generating deeper and more accurate understanding — and decisions.
2.Generating Visuals: Step by Step
This fact of our species being visual led Gen AIs to prioritize visuals. Let’s follow a typical process of visual generation by Gen AIs.
I obtained this information both in practice (e.g., testing nine Gen AI apps, among free and paid ones, to create Immersera’s Visual ID), and via AI certifications, besides those from other areas, such as visuals, etc. But also, through a lot of research and checking, cross-referencing AIs, both the top ones and lesser-known ones.
This article, for example, required many prompts and replies, several of them long ones, in several Gen AIs, as well as web checks. (By the way, even if you are an expert in Gen AIs, it will interest you for this stitching with 0.1-Sec-Positioning; and if you find errors, please point them out).
2.1 Start
- Prompt: The user enters their text prompt. (Tools like DALL-E and Midjourney, for example, are known for their flexibility with prompts: they allow from succinct instructions to more elaborate descriptions — and replies — for more detailed or creative results).
- Tokenization: Words are converted into numbers (tokens, from the old English “tācen”: physical tokens that proved identity or authorization in the 14th to 16th centuries). Tokenization involves three main steps:
- Fragmentation: Breaks down the text into individual words or pieces of them.
- Transformation: Each of these pieces of text is converted into a token, a unique identifier.
- Numerical assignment: The tokens are then translated into numbers that the machine can understand and manipulate — like words were translated into Morse code at the beginning of the last century. Examples:
- “Passion” could be tokenized as [645, 330, 789];
- “Challenge” would become [456, 234, 567];
- “Artificial Intelligence” would turn into [123, 789, 456, 890].
2.2 Processing
Search for visual patterns: the Gen AI then scours its visual data library, of varying vastness depending on the model, to find relevant similarities and patterns that can be used to create the requested image. DALL-E 3, for example, has about 12 billion parameters. Technologies such as GANs or CNNs (details below) can be used to identify and generate visual patterns from the data.
Refinement via iterative adjustments: the AI operates successive adjustments on the generated image to gradually increase accuracy and realism — varying one and the other according to the Gen AI. Diffusion Models (like Stable Diffusion) or Transformers (DALL-E 3) often come in at this point: the first one helps smooth and detail the images; the latter, to adjust contextual coherence.
2.3 Delivery visual generation
The Gen AI generates and refines details in the visuals, based on the prompt, seeking alignment and relevance. Transformers or Diffusion are used for the final synthesis, ensuring that the final image aligns with the context of the prompt.
Futuristic BMW: option for base architecture (GANs) that yields realism, crucial to materialize still intangible assets (image: Auto Discoveries)
3.Get to Know Gen AI Architectures
Let’s now look at the variety of AI model architectures. They are the core of image generation: they act as initial and main building blocks in development and execution. Of these, the best known are CNN, Diffusion, GAN, and Transformers.
3.1 CNN (Convolutional Neural Networks)
CNNs are inspired by the visual cortex of animals. This is where cats come in: in 1962, neurologists Hubel and Wiesel demonstrated that felines have an exceptional ability to detect edges and contrasts. That’s why CNNs are mainly used to recognize patterns and edges.
Explanation: Neural networks are so called because they mimic the functioning of the brain, via artificial “neurons” that transmit information between themselves and in various layers. “Convolutional” comes from the Latin “to roll”, “to wrap”: they “roll” two mathematical functions together to generate a third.
Analogy: Imagine CNN as a series of filters in a camera that highlights edges and shapes to compose an image, just as our eyes do every time we look at something in the world.
3.2 Diffusion Models
Diffusion Models, such as those used by Stable Diffusion, pick out the image from among a mess that they own added to it, as part of the gradual refining process. This effort to detect the image amidst the noise improves the model’s ability to detect, elaborate, and refine the image. This allows for both creative definition and more precise adjustments.
Analogy: Think of Diffusion models as marble sculptor. They start with a block full of “brutality” (noise). The gradual chipping away of the piece, with flakes being taken off the block with a hammer, is a dynamic self-education for the Gen AI. Through it, it gradually refines the search and consequent obtaining of the image.
3.3 Generative Adversarial Networks (GANs)
GANs consist of a pair of networks: a generator and a discriminator. They work in dispute: the generator creates images; the discriminator scrutinizes their authenticity. It is this “adversariality” between them that improves them reciprocally and thus produces images that primarily seek realism.
Analogy: Imagine a painter (generator network) trying to deceive an art critic (discriminator) with his works. The critic seeks to distinguish between real, valid works and fakes. This instigates the painter to continuously refine his technique; the same goes for the critic.
3.4 Transformers
Transformers use attention mechanisms (which focus on specific parts of the input for better understanding) to process and generate data sequences. They are very effective in capturing complex contexts and long-range relationships (connections between distant elements, such as the images in a book, distributed throughout it, but not disconnected from each other).
Explanation: The name comes from the ability to transform input information into context-rich processed outputs. As for long-range relationships, for example: when generating images for that book, if a character wears a necklace, the connection and consistency between distant images will be obtained by inserting this necklace even in scenes that don’t mention it.
Analogy: Think of Transformers as a book editor who highlights the most important parts of a text to summarize the complete plot, ensuring understanding and emphasis on the crucial parts.
4.What Each Architecture Yields
Therefore,
CNN: Ideal for precision in image recognition.
Diffusion: Excellent for balance between creativity and precision.
GANs: Specialists in realism, with limited potential for creativity.
Transformers: Excellent at capturing complex nuances and for contextual coherence.
Diffusion injected realism into Heinz's effort to exploit its universal familiarity in the "What's a Ketchup" campaign: 850 million impressions
5.Ranking: Six Gen AIs
I combine here concrete experience (the above-described test of nine Gen AI apps and the fulltime use of other Gen AI apps, plus a lot of research and certifications) with those provided directly by various Gen AIs, in a variety of profiles.
This information has been refined via exhaustive replies and checking, but it may contain errors. So, if you see any mistakes, feel free to fix them.
I rank below six better-known Gen AIs according to their visual delivery profiles, determined by architectures, their strengths and weaknesses, creativity, precision, and the number of parameters used, when public:
DALL-E 3
Technology: Transformers, which are part of OpenAI’s GPT architecture.
Parameters: About 12 billion.
Strength: Creative, capable of generating imaginative visuals — and I add: with unexpected features that go beyond the prompts: from the nine Gen AI apps I tested in 2023 (retesting some in 2024) for Immersera’s Visual ID, only DALL-E worked, and even then, only in one of its two evaluated applications. But now at the end of 2024, the same DALL-E 3 couldn’t deliver me a Cyclops.
Weakness: Sometimes sacrifices precision for creativity, with unexpected results.
Midjourney V6
Technology: Mainly combines Diffusion and, possibly, Transformer components.
Parameters: Not disclosed.
Strength: Performs well in textures and artistic styles. Flexible with prompts; suitable for varied needs.
Weakness: May not strictly adhere to prompts; sometimes delivers less relevant or even disappointing results.
Stable Diffusion 3.5
Technology: Diffusion, used to create detailed, high-quality images.
Parameters: About 8.1 billion in the Large version.
Strength: Balance between creativity and prompt fidelity. Open source model: allows extensive customization and community improvements.
Weakness: Requires more technical knowledge to maximize its potential.
Runway ML / Gen-3 Alpha
Technology: Offers a variety of Deep Learning models, including GANs and CNNs, to create images and videos.
Parameters: Not disclosed.
Strength: User-friendly interface designed for creatives and marketers.
Excellent for generating visuals quickly and efficiently. Seamlessly integrates into creative workflows.
Weaknesses: May lack the depth of features found in more complex systems.
DeepArt
Technology: CNNs, used to transform photos into artworks in the style of famous artists.
Parameters: Not disclosed.
Strength: Transforms existing images into artistic interpretations. Strong for branding purposes, allowing the integration of brand styles into visuals.
Weakness: Better for adaptations, i.e., limited in the ability to generate completely new visuals.
Artbreeder
Technology: GANs — allows mixing and modifying images to create visuals.
Parameters: Not disclosed.
Strength: Facilitates collaborative creation and image mixing, enhancing creativity. Users can directly manipulate images, leading to unique visual results.
Weakness: May be complicated to handle its complexity and range of options.
Nutella turns to CNN to help consumers customize jar labels: 7 million unique packaging designs using Gen AI
6.Focusing on Base Architecture: BMW etc.
Three cases of global icons demonstrate how focusing on a base architecture, and not on a Gen AI, determined the success of their campaigns:
1.BMW (GANs)
- Gen AI used: BMW adopted not a Gen AI, but an architecture (GAN). This was then customized in their own design labs or in partnership with AI companies. As a mere framework, the GAN needed to be trained with specific data, hyperparameters, feature engineering, etc. This confirms the centrality of architecture in Gen AIs.
- Technology used: Generative Adversarial Networks (GANs) were chosen precisely because they generate highly realistic images — ideal for concretely visualizing obviously intangible assets such as futuristic projects.
- Campaign success: the use of GANs allowed the creation of innovative visuals, even more impactful because they were realistic. This captured consumers’ imagination with a tangible vision of the future.
- o Impact: Although engagement metrics were not found, the campaign was noted for its innovative use of AI and its impact on brand perception.
2.Nutella (CNNs)
- Gen AI used: Not disclosed.
- Technology used: Convolutional Neural Networks (CNNs), for pattern recognition and generation of personalized designs.
- Campaign success: CNN-driven personalization allowed direct and individualized interaction with the consumer, resulting in a more engaging purchase experience.
- Impact: Nutella’s campaign created 7 million unique jar designs using AI. It was a success, selling out in one month.
3.Heinz Ketchup (Diffusion – Stable Diffusion)
- Technology used: Diffusion, which allows the creation of realistic images.
- Campaign success: Stable Diffusion’s ability to create highly detailed visuals consistent with the brand message generated significant engagement, which strengthened Heinz’s branding.
- Impact: Generated over 850 million global impressions, which yielded over 2500% return on media investment. Social media engagement was 38% higher than in previous campaigns.
Conclusion
I hope this notion of the essentials of how Gen AIs generate images, synthesized in the main instance responsible for this — the base architecture — and in the step-by-step, helps you match tools with expectations, optimize time, and thus escape that Frustration equation. This, at the level of schedules, of the rational.
But there’s another level, more difficult and solitary, even if you’re part of a team. It’s that of imagination, therefore, of risk. It includes the choice of the Gen AI itself — which implies betting on one architecture and not another and, therefore, on one delivery profile or another, one being more realistic, another more creative, and so on.
This, added to the fragilities of Gen AIs, which are still a weak class, echoes the solitary gesture of the Renaissance artist who felt brushes, paintbrushes, mallets, chisels, gouges, until something nestled in their hand. They then set off towards the block of marble, the canvas, whatever it was.
That’s like what happens to us when we start to understand a little more about the architectures of Gen AIs. Their different ingenuities, however technical they may seem, light a spark in us — as they did with BMW — amidst the weaknesses of these models, which are mind-boggling. This spark is vital in this era of 0.1-Sec-Positioning-or-Scrolling.
At Immersera, based on my verbal-visual synergy (used to be Comms Head for three Gov Ministers and C.H. consultant for UN-Women), on my certs like on AI (USP and Exame), and on our network’s expertise, we provide consultancy/mentoring on 0.1-Sec-Positioning, positioned content, SEO, branding, websites, cases, and crisis management.