Adcom's Profile

Difference between embedding and LoRa

Embeddings is what you actually feed to the AI models , its a buncha numbersNormally the tokenizer converts the prompt text to an embeddingBut its possible to keep embeddings as .safetensor files . Being numbers , they do not have to correspond to written textOne can train by adjusting an embedding to create specific images for a given AI model. This is called textual inversionEmbeddings are not used alot with the natural language models , but were popular for SD1.5 and SDXL / Illustrious modelsIf the image output is y , the the checkpoint is like a square matrix A , multiply a matrix with a vector x , and you get another vector y , which is the output Ax = y The embedding is the vector x you multiply the matrix with.The LoRa , is a matrix delta_A which you add to the matrix A in the model so with LoRa , its (A + delta_A) * x = y//---//LoRas are more popular , as the prompt x can be anything you want it to be//---//With embeddings , one can combine it into a written prompt , so if the keyword to trigger the embedding is 'beckham' , them I can write in prompt 'a beckham on a football field'The 'beckham' is no longer a single token word , but is replaced by the 8x716 embedding , or however large it isEmbeddings are usually 8-16 token positions long//---//In summary , you can use LoRa and embeddings in combination . They act separately on the output

Adcom

1

The Attention Mechanism , for use in T2i and Qwen Edit

If you have 2 images to use as reference for Qwen Edit , write stuff in prompt that will put attention on the target imagePrompt attention applies to reference images but also applies to images created from pure noise , as the generation happens over multiple steps. Since in T2i you start from pure noise (no stuff in image to reference) using terms like 'left' , 'right' , 'up' or 'down' will still place attention in the cardinal directions of the image. Here are more examples of prompt attention.//---//Back to more practical matters , consider we wish to merge these two images with character replacement using Qwen EditImage 1To be placed in image 2//---//So low frequency noise = remove white background , she has a blue bodysuit , and she has yellow shoes , she has long hairHigh frequency noise = she sits by a car in a parking lot at nightAnd usually top off with ' she has light hair 'so prompt is ``` remove white background , she has a blue bodysuit , and she has yellow shoes , she has long hair , she sits by a car on the right in a parking lot , the sky is dark , she has light hair ```You can set as 512x1024 for 0.9 credits cost at 8 stepsCosts twice as much otherwise at 1.8 creditsSo not great initially cuz I didnt describe the phonk girl by the car enuff in promptBut that can be fixed by straight up captioning the phonk girl oarking lot image and appending it to the end to bring more attention to it``` the night sky is black , parking lot , remove white background , she has a blue bodysuit , and she has yellow shoes , she has long hair , she sits by a car on the right in a parking lot , the sky is dark , standing next to a sleek, dark-colored sports car. The car has a distinctive front grille and is parked in a parking lot at night, illuminated by streetlights. The background features several other cars parked in a lot, with some trees and bushes visible. The person appears to be posing for the camera, with one hand resting on the hood of the car. she has light hair ```Now I brought too much attention the the parking lot imageLikely cuz I decided to start with ``` the night sky is black ``` which brought attention to the girl too as she is also dressed in blackFinal prompt after tinkering can be``` remove white background , she has a blue bodysuit , she has yellow shoes , she has light hair , she sits by a car on the right in a parking lot , the sky is dark , standing next to a sleek, dark-colored sports car. The car has a distinctive front grille and is parked in a parking lot at night, illuminated by streetlights. The background features several other cars parked in a lot, with some trees and bushes visible. she has a blue bodysuit , she has yellow shoes , The person appears to be posing for the camera, with one hand resting on the hood of the car. she has light hair ```Post: https://tensor.art/images/954799554404853126?post_id=954799012165218733This Qwen Edit LoRa is optimized for 3 image merging: https://tensor.art/models/954755878882418264Thats it for the tutorial on using prompt attention for Qwen edit generation./Cheers!

Adcom

1

Qwen Edit converts image to latents. Use it for LoRa training

I realized that Qwen Edit acts upon the latent instead of the image which is awesum for lora training. If you have a cool image at some weird resolution like 673x512 then you can rescale it as a 1024x1024 frame at full resolutionLike tron sirens for example from youtubeFrames are pretty bad for lora ut the VAE can convert it to a latent and it cares very little with regards to resolution or background contrastDo frame extract , and caption each image and run the latent thru Qwen editRaw result:The image turn to latent by the VAE seems to inherit some of the youtube blur. Lets see how things work is we just screenshot the yt video itself on phone , going from 600x500 size to 1024x2000 size (but same blur)Screenshot a segment of the youtube video as patterns for latent encoding are always relative to image dimensions. So projecting rectangle => square image via VAE is tricky , better try to use aquare ish shapes as referenceScreenshot and recreated 1024x1024 belowBenefits: more color contrast , less artifacting , and you can crop out desirable patterns for training into a collage as desired.Post: https://tensor.art/images/946858389655272002?post_id=946858359590500831When using multiple images as reference it helps to know that the prompt has location in imageSo if you have two images , if you want the output to focus on a specific image , describe the backround or objects around itUnlike text to image , Qwen Edit uses an existing image , so you need not describe anything at all , just where in the image it existse.g ``` the girl is underwater at the bottom. the water is blue. the top is a building. blue ceramic tiles are at the bottom. she has light hair and she holds her hands at either side , the left side is a green forest ```So its all the stuff you want in the final image , but you always specify a locationIts useful to use reference photos for Qwen Edit with colored backgrounds , that way you can just say 'the background is green' in the prompt to specify which image you want to focus onUse case examples for Qwen edit https://huggingface.co/blog/MonsterMMORPG/qwen-image-edit-full-tutorial-26-different-demoFor outfits I can highly reconmend using some 3DCG artist paired up with a real photoImage 1Image 2So if I want a photoreal image of the girl holding the weapon ``` give her a black dress with bare shoulders , her face is from the side , she has blonde hair on her left side , her skin should be like her bare arms and legs , she is holding a rifle with a scope , the floor on the right has a black and white square pattern , the rifle has an american flag on it , the floor has grated holes , the red carpet on both the left and the right side , she is leaning against a marble pillar , the top left has people in the background , the top right is a car and a cityscape ```As long as enuff stuff matches the photoreal image , that will be the reference used. I don't have to describe the character at all , just the environmentThe 'left and right' wordings works thanks to the attention stuffTaking the best parts of image 1 and image 2 (your choice what to choose ofc)Post : https://tensor.art/images/950083206793658962?post_id=950083189613789685Cheers!

Adcom

2

Collage training for Chroma and Illustrious

Consider this;  if you have 100 training images , why are there not 100 image outputs for every epoch when training the lora , to match against the 'target' training image? Reason: LoRa training is done entirely in latent space.The training image us converted to a vector using Variational Auto Encoder , the VAE.Have you done reverse image search? Reverse image search also converts the input image to its latent representation.Try doing a reverse image search on composite of two images  , i.e two images side by side like a woman in a dress and a sunflower.Results are images with dresses , and images with sunflowers , or a mix inbetween (if such images exist) Conclusion:  The VAE representation can hold two images at once , or more.  By using composites in a 1024x1024 frame you can train on two images at once.   However , when putting two images in a single 1024x1024 frame the learned pixel pattern will be relative to the image bounds. Example :   single full body person in 1024x1024 image  takes up the full 1024 pixel height.Put two people next to one another in the 1024x1024 frame , and both people will still take up the full 1024 pixel height.Put 4 people in a 1024 x 1024 frame in a grid ,  and each person takes up half the image size at 512 pixel height. The AI model cannot scale up or down trained pixel patterns relative to image dimensions.If you want image output to only be full length people , ensure the trained patterns are the full 1024 pixel pattern height.Granted;  the same principle applies in the x-axis.If you have a landscape photo ,  and the pixel pattern has a pleasant composition along the x-axis  ,  then you can place two landscape photos on top of one another to train the horizontal pattern i.e  2 landscape images each 1024x512 in size to build the 1024x1024 frame.Verify by doing reverse image search on the frame. Try doing a reverse image search on blurry images versus high resolution images.You will find that blurry images are added to VAE but only up to a certain point.One cannot fit more pixels into a 1024x1024 frame than what already exists.You will find that based on the reverse image results how much the image can impact the latent representation.Why can an AI model create images that are not 1:1 to its training data?How come when you prompt a sword with AI , it sticks out at both ends of the handle?Reason:  The AI model learns localized patterns.  Unconditional prompting.The AI model also learns to associate patterns with text.  Conditional prompting.The input X to the AI model is a mixed ratio of conditional prompting and unconditional prompting set by the CFGGiven as X = X_unconditional (1-CFG)  + x_conditional CFGYou can train the lora so that the model learns purely from unconditional prompting by not having any caption text at all.Or , you can make the model learn conditional prompting that describes all the pleasant looking stuff in the training images you have.What is a prompt?  The prompt text is also transformed into an encoding using the text encoder.This is done by converting each common word  or common word segment of your prompt into a numerical vector.    For example;  CLIP_L has dimension 768 and the batch size is 75 tokens (excluding the 2 delimiter tokens at the start end of the encoding , the real batch size is actually 77) .So any text you write in CLIP less than 75 'words'  in length can be expressed as a 75x768 matrixThis 75x768 matrix is then expressed as a 1x768 text encoding. How is this done?   Lets look at a single element , a 1x75 part of the text encoding.Each of these 75 positions are a sine wave at fixed frequencies ,  75 fixed frequencies in total in descending order. The frequencies are alternating ,  so all the even positions have +0 degrees offset and all the odd positins have +90 degrees offset. The token vector element sets the amplitude of the sine waves. What is a soundwave?   It is a sum of sinewaves with different frequencies at a given amplitude. Ergo:  Your 1x75 element row is a soundwave. The 1x768 text encoding are all the 768 1x75  soundwaves played at once.  The text encoding is a soundwave.The way the text in your prompt impacts the text_encoding , is analgous to components within soundwaves like music.How to make stuff in music more prominent?First method , at a given freqency magnify the amplitude of the noise.This is how weights work , they magnify the token vector by a given factor  , e.g  (banana : 1.3)   is the token vector for banana multipled by the factor 1.3   , and consequentially the amplitude of the soundwave at whichever position banana is locates at will be amplified as well The second method to engance sound presence is to reoeat it at different frequencies.You know that sound with closely matching frequencies will interefere with one another.But sound at low frequency and the sane sound played at high frequency is harmonious.Ergo;   to enhance presence of a concept in a prompt you can either magnify it with weights  or you can repeat the exact same word or phrase further down in the batch encoding.How does this relate to captioning in LoRa training?  If you want the conditional prompting training to focus on a specific thing in the image , repeating a description at different section in the prompt is good.This is especially useful in natural langauge text encoder with a large batch size if 512 tokens.This also means that as long as the 'vibe' of the captioned text matches whats in the image , the LoRa effects will trigger on prompts close to that 'vibe' as well.It really is up to how you plan on using the lora with the AI model and what prompts you generally use that decides the captioning.Third part. Have you noticed how AI models can create realistic depictions of anime characters or anime depictions or real celebrities?The AI model is built like a car factory ,  that has a conveyor belt on one end ,  multiple stations within the factory that assembles stuff , and the stuff that pops out on the other side of the conveyor belt is some kind of car.You can throw absolutely anything onto the conveyor belt at the stations will turn it into a car.  A tin can , a wrench , a banana or something else.The stuff you put on the conveyor belt is the prompt.The stations are the layers in the AI model.You will find that each layer in the AI model is responsible for one task to create the 'car'  ie the ginished image.One layer can set the general outline of shapes in the image.Another layer might add all the red pixels to the image.A third layer might add shadows.A fourth might add grain effects or reflective surfaces.  It all depends on AI model but all if these layers are usually very 'task specific'So when training a lora , you are actually training all if these stations in the car factory , seperately , to build the 'car'  the image.Shape matters the most prior to creating an image concept. There it is well advised to have a clear contrast between all relevant shaoes in the lora training images.A woman against a beige wall is a poor choice , since human skin blends well into beige and white surfaces.But a woman against a blue surface that clearly contrasts the shape is excellent.Consider that you can create AI images from existing training imagesThat means that the AI model learns patterns from the training image , and uses those different patterns it knows to make new imagesIf you zoom in on the pixel pattern in any AI image you can find jpeg artifacts withinthe AI image , these normally only appear at the edges in normal imagesSuffice to say this method works 😀Practically , if you make a sword in illustrious , you see how the sword sometimes goes out both ends?The CFG parameter in the AI model is a blend between what you prompt , and 'adjecent pixels in the image'The patterns which the AI model learns are 'localized' within the training imageSo the reason why the collage method works is thanks to unconditional prompting parameterThe prompt you actually feed the AI model isnt purely the text promptAsk GROK on Twitterwhat is the relationship between cfg conditional prompting and unconditional prompting?The relation is X = CFG x_conditional + (1-CFG) x_unconditionalThe x_unconditional uses the image created thus far as the input argumentSo it 'fills in patterns' where it is likely for those patterns to beBut the gist is that this relationship is all about pixel to pixel adjecencySo location in the training image for specific pattern don't matterThe collages is like a difficult math problem you task the AI model to solve , so it can become more adept at solving easier math problems later onSo the AI model will usually never be able to recreate the collage training images 1:1 , but the AI model will become very adept at recreating the patterns within the training image in the attemptThe tool to build collages doesn't matterThe most important thing to know is that the trained pattern will always be relative to the imageSo ideally one should have at least one pattern that goes end to end in the imageExample image hereThe second thing to know is that AI images learn patterns that have good contrast to backgroundimageThe 'shape layer' as I call itBenefit of collage training that you can easily crop out bad patterns. Condensing the set to only good patterns.One more thing; if the background of the collage is #181818 gray it will perfectly match the gray background on TensorThis creates a cool 3D effect in the gallery 😀Syntax has examples too in illustriousI always link this video as a source if you wanna know absolutely all of the theory. Its from the SD1.5 days but still is true for present models https://youtu.be/sFztPP9qPRc?si=B_B353yktSpeKeicThis one has a lot of nonsense overly dense information but the gradient illustration is very cool https://youtu.be/NrO20Jb-hy0?si=6us5FRM7qhmD_auHCheers!

Adcom

5

Chroma Models - The Stuff I Know

Chroma is a finetune of FLUX Schnell created by Lodestones.HF Repo: https://huggingface.co/lodestones/ChromaAllegedly a technical report on Chroma may be released in the future.Don't hold your breath though. From my own personal experience and others , Lodestone is not keen on explaining exactly what he is doing or what he is planning on doing with the Chroma models.TLDR; There is no documentation for Chroma. We just have to figure it out ourselves. I'm writing this guide despite having nothing close to factual information with regards to exact training data , recommended use and background information on the Chroma models.Aside from the total lack of documentation , the Chroma models are an excellent upgrade to base FLUX model and Lodestones deserve full credit for his efforts.The cost for training the Chroma models is allegedly (at present) over 200K USD for running the GPU. Model Architecture for ChromaKey feature is that this model has been pruned from the FLUX Schnell model , i.e the architecture is different.The Keys for the .safetensor file of FLUX Dev fp8 (B) , FLUX KREA (K) and FLUX Chroma (C)As such , don't expect good results from running FLUX Dev Trained LoRa on Chroma.Another minor changes in archetecture include the removal of the CLIP_L encoder. Chroma relies solely on the T5 encoder.Architecture (Sub-models)The Chroma models have different versionsChroma V49 - The latest trained checkpoint on Chroma. It has undergone 'high resolution training'. Unlike V48 , it is assumed Chroma V49 has undergone 'hi-res training' like the V50 , but not confirmed due to lack of documentation. https://tensor.art/models/895059076168345887Chroma V50 Annealed - A checkpoint merge for the last 10 Chroma checkpoints V49-V39 , from which it has undergone 'hi res training'. https://tensor.art/models/895041239169116458'Annealed' I have been told on Discord means that the model has undergone one final round of training through all 5 million images in training data at a very low learning rate. Plans are to make the V50 Annealed the 'official' FLUX Chroma model under the name 'Chroma1-HD'Chroma V50 - A bulls1t checkpoint merge created to secure funding for training the other checkpoint models. Don't use it.Chroma V50 Heun - An 'accidental' checkpoint off-shoot that arose when training the Chroma model. It works surprisingly well for photorealism at 'Heun' or 'Euler' sampler with 'Beta' Scheduler at 10 steps 1 CFG , hence the model name. https://tensor.art/models/895078034153975868Chroma V46 Flash - Another 'accidental' offshoot in training that boasts the highest stability in output of all the Chroma checkpoints. Try running at Euler Sampler with SGM Uniform sampler at 10 Steps , 1 CFG. An excellent model! https://tensor.art/models/889032308265331973What model should I use for LoRa training? Either V49 or V50 Annealed are excellent choices in my opinion.The V49 and V50 Annealed models can both run at 10 steps with Beta Scheduler at CFG = 1 and Guidance Scale = 5 , at the cost of 0.4 credits per image generation here on Tensor. TrainingThe Chroma model can do anime , furry and photorealistic content alike , including NSFW , using both natural language captions and danbooru tags.The training data has been captioned using Google Gemma 12B model. A repo assembled by me has a collection of training text-image pairs used to train Chroma , which are stored as parquet files accessible via Jupyter Notebook in the same repo:https://huggingface.co/datasets/codeShare/chroma_prompts/blob/main/parquet_explorer.ipynbYou'll need to download parquet to your Google Drive to read the prompts:Example output from the E621 setLodestones repo's (⬆️ items from these sets are included in my Chroma prompts repo for ease of use) :https://huggingface.co/datasets/lodestones/pixelprosehttps://huggingface.co/datasets/lodestones/e621-captions/tree/mainTip; Ask GROK on Twitter for Google Colab code to read items in these sets.//---//The Redcaps datasetA peculiar thing is that Chroma is trained on the redcaps dataset redcaps.xyzThese are text - image pairs where the image is a image found on reddit and the trxt prompt is the title of the reddit post!If you want to have fun time prompting Chroma; copy paste a reddit title either off the redcaps.xyz page , or from the chroma_prompts repo parquet files , and see for yourself.Example of a redcaps prompt:I found this blue thing in my backyard. Can someone tell me what it is? The 'Aesthetic' tagsThe pixelprose dataset used to train Chroma has an 'aesthetic' score assigned to each image as a float valueThis value has been rounded down as 'aesthetic 1, aesthetic 2, .... , aesthetic 10'Additionally , all AI images used to train Chroma has been tagged as 'aesthetic 11'(more later)Anime testPromptwhat is the aesthetic 0 style type of art? anime screencap with a title in red text Fox-like girl holding a wrench and a knife, dressed in futuristic armor, looking fierce with yellow eyes. Her outfit is a dark green cropped jacket and a skirt-like bottom. \: title the aesthetic 0 style poster "Aesthetic ZERO" CaptioningGemma 12B model was used to caption Chroma prompts , however this model dies not run on free tier T4 Colab GPU like the well established Joycaptions.To mitigate this ; I'm training the Gemma 4B model to specialize in captioning images in the same format as the Chroma training data.More info on the project here: https://huggingface.co/codeShare/flux_chroma_image_captionerFinding PromptsI recommend you visit the AI generator at perchance.org for Chroma prompts. They have had the Chroma model for their T2i generator for awhile and there are lots of users posting to the galleries.Its hard to browse old posts on perchance so it will do you well to 'rescue' some prompts and post them here to Tensor Art.ResolutionsRefer to the standard values for Chroma and SDXL models//---//

Adcom

6

4

Adcom

Models

Qwen 2511 Image Edit AIO Rapid-V19-NSFW

LightingRemap Qwen-Edit-2511-Alpha0.2

ICEdit LoRa (Qwen Edit 2511)-model

[BASE] Anypose Qwen 2511 imade edit-base

[HELPER] Anypose Qwen 2511 Image Edit-helper

Tellurion-e101

Grunge Maids -r1

Made in Abyss-e52

Claymore-e52

Tsutomo Nihei-e81

Kuvshinov w. scifi manga (Chroma HD trained)-e49

40K Painterly Style-40K

Workflows

From To Steps

From to Steps

Articles

Difference between embedding and LoRa

The Attention Mechanism , for use in T2i and Qwen Edit

Qwen Edit converts image to latents. Use it for LoRa training

Collage training for Chroma and Illustrious

Chroma Models - The Stuff I Know

Posts