The vector behind your prompt


Updated:

What are weighted prompts?

When setting a weight in a prompt , e.g ' (banana :0.3) you are scaling down the magnitude of the token vector(s) sine wave component by a factor of 0.3

Weighted prompts are NOT intended design for the text encoder and the user is discouraged to rely on it for results !

Better strategy than using weighted prompts

Use the knowledge from this article that prompts are sine waves at descending frequencies from its position in the prompt. Know that the better strategy is to instead REPEAT the key concepts in the prompts at different positions in the prompt !

Repeating the words at different positions in the prompt ensures the soundwave will carry the concept both at the low frequency sine wave range , and the higher frequency sine wave range.

Inversely , foreign concepts in training data will blend more easily the closer proximity these concepts have in the positional training data.

A quirk with the T5 is that blank space " " is a strong discriminator between concepts.

Removing the blank space " " separator between concepts increases the likelyhood of a good concept blend in the output encoding , e.g writing "carbananatree" instead of "car banana tree" .

What is Guidance? And what is a negative prompt?

The final prompt encoding vector 'pos_prompt' is subtracted by the negative prompt 'neg_prompt'

using the equation

conditional = guidance_scale * pos_prompt  - neg_prompt

Most comfyUI setups uses the CFG system shown on the left side of this diagram

The Guidence scale 'alpha' is the symbol Phi in this diagram.

However , there is an unseen parameter here called the 'CFG parameter' which sets the ratio betwen conditional generation ('make this thing!' ) vs. 'unconditional generation' ('fill in the gaps in this images based on adjecent pixels!' )

If the CFG parameter is called 'x' then the the AI model output to our prompt 'result' will be

result = (1+x) * unconditional + x * conditional 

The vector 'result' is decoded by the Variational autoencoder (VAE) into the 'desired image' .

The 'Variational' part of the Variational autoencoder is the reason why you get different image output for different seeds.

This 'desired image' is then attempted to he recreated by the sampler (actually an algorithm for a differential equation solver) over N steps.

For example , this is the Heun sampler differential equation solver: https://en.m.wikipedia.org/wiki/Heun%27s_method

More generation steps = better result right?

In this example I run Chroma Heun , https://tensor.art/models/895078034153975868?source_id=njuypVDrnECwpvAqYH718xIg at 5 steps , 10 steps and 15 steps

While this demonstration is not conclusive , for Chroma models in particular the 'aesthetic quality' improves at LOWER step counts , with the ideal at ~10 steps (0.4 Tensor credits per image).

However text legibility is better at 15 steps (0.6 Tensor credits per image).

The composition in the image is unchanged at 5 steps (0.2 Tensor credits per image).

What is a prompt actually?

Your prompt is actually a soundwave. The prompt is several soundwaves in fact , but lets assume its a single 'noise' built from a set of sinewaves at different amplitudes.

What decides the amplitude if the sinewaves? The token vector sets the amplitude.

What decides the frequencies of the sinewaves? The position the word has in the prompt!

A token vector are like the alphabet A-Z , where each letter has its own amplitude characteristic.

Except this alphabet has been expanded to include common English words making the 'alphabet' a size of about ~50K components.

For example, the word 'banana' is its own token vector in this token vector alphabet , as well as 'blueberry' , the letter 'A' or the number '8' or an emoji '👍' .

You can test how your text is tokenized into token vectors here:

You can bowse the vocab.json token vector 'alphabet' here: https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/tokenizer/vocab.json

Hence every sentence or word combinations you can think of are converted into a their respective 'soundwave'

To determine the frequencies of the sine waves used in the positional encodings for the T5 (Text-to-Text Transfer Transformer) model, we need to examine how positional encodings are constructed in the T5 architecture, specifically focusing on the sinusoidal positional encoding scheme commonly used in transformer models like T5.

Positional Encoding in T5

The T5 model, like many transformer-based models, uses positional encodings to incorporate the position of each token in the input sequence, as transformers do not inherently capture sequence order.

For sinusoidal positional encodings, the frequencies of the sine (and cosine) waves are determined by a formula that assigns a unique encoding to each position in the sequence based on its position index and the dimensionality of the embedding.

Sinusoidal Positional Encoding Formula

T5-Specific Parameters

Calculating the Frequencies

Summary of frequency of sine wave to positional encoding

3
0