Adcom

Adcom

754389913230900026
🤘All LoRas free to download
Discord for DMs : discord.gg/exBKyyrbtG
627
Followers
183
Following
83.3K
Runs
1.5K
Downloads
9.6K
Likes
656
Stars

Articles

View All
Collage training for Chroma and Illustrious

Collage training for Chroma and Illustrious

Consider this;  if you have 100 training images , why are there not 100 image outputs for every epoch when training the lora , to match against the 'target' training image? Reason: LoRa training is done entirely in latent space.The training image us converted to a vector using Variational Auto Encoder , the VAE.Have you done reverse image search? Reverse image search also converts the input image to its latent representation.Try doing a reverse image search on composite of two images  , i.e two images side by side like a woman in a dress and a sunflower.Results are images with dresses , and images with sunflowers , or a mix inbetween (if such images exist) Conclusion:  The VAE representation can hold two images at once , or more.  By using composites in a 1024x1024 frame you can train on two images at once.   However , when putting two images in a single 1024x1024 frame the learned pixel pattern will be relative to the image bounds. Example :   single full body person in 1024x1024 image  takes up the full 1024 pixel height.Put two people next to one another in the 1024x1024 frame , and both people will still take up the full 1024 pixel height.Put 4 people in a 1024 x 1024 frame in a grid ,  and each person takes up half the image size at 512 pixel height. The AI model cannot scale up or down trained pixel patterns relative to image dimensions.If you want image output to only be full length people , ensure the trained patterns are the full 1024 pixel pattern height.Granted;  the same principle applies in the x-axis.If you have a landscape photo ,  and the pixel pattern has a pleasant composition along the x-axis  ,  then you can place two landscape photos on top of one another to train the horizontal pattern i.e  2 landscape images each 1024x512 in size to build the 1024x1024 frame.Verify by doing reverse image search on the frame. Try doing a reverse image search on blurry images versus high resolution images.You will find that blurry images are added to VAE but only up to a certain point.One cannot fit more pixels into a 1024x1024 frame than what already exists.You will find that based on the reverse image results how much the image can impact the latent representation.Why can an AI model create images that are not 1:1 to its training data?How come when you prompt a sword with AI , it sticks out at both ends of the handle?Reason:  The AI model learns localized patterns.  Unconditional prompting.The AI model also learns to associate patterns with text.  Conditional prompting.The input X to the AI model is a mixed ratio of conditional prompting and unconditional prompting set by the CFGGiven as X = X_unconditional (1-CFG)  + x_conditional CFGYou can train the lora so that the model learns purely from unconditional prompting by not having any caption text at all.Or , you can make the model learn conditional prompting that describes all the pleasant looking stuff in the training images you have.What is a prompt?  The prompt text is also transformed into an encoding using the text encoder.This is done by converting each common word  or common word segment of your prompt into a numerical vector.    For example;  CLIP_L has dimension 768 and the batch size is 75 tokens (excluding the 2 delimiter tokens at the start end of the encoding , the real batch size is actually 77) .So any text you write in CLIP less than 75 'words'  in length can be expressed as a 75x768 matrixThis 75x768 matrix is then expressed as a 1x768 text encoding. How is this done?   Lets look at a single element , a 1x75 part of the text encoding.Each of these 75 positions are a sine wave at fixed frequencies ,  75 fixed frequencies in total in descending order. The frequencies are alternating ,  so all the even positions have +0 degrees offset and all the odd positins have +90 degrees offset. The token vector element sets the amplitude of the sine waves. What is a soundwave?   It is a sum of sinewaves with different frequencies at a given amplitude. Ergo:  Your 1x75 element row is a soundwave. The 1x768 text encoding are all the 768 1x75  soundwaves played at once.  The text encoding is a soundwave.The way the text in your prompt impacts the text_encoding , is analgous to components within soundwaves like music.How to make stuff in music more prominent?First method , at a given freqency magnify the amplitude of the noise.This is how weights work , they magnify the token vector by a given factor  , e.g  (banana : 1.3)   is the token vector for banana multipled by the factor 1.3   , and consequentially the amplitude of the soundwave at whichever position banana is locates at will be amplified as well The second method to engance sound presence is to reoeat it at different frequencies.You know that sound with closely matching frequencies will interefere with one another.But sound at low frequency and the sane sound played at high frequency is harmonious.Ergo;   to enhance presence of a concept in a prompt you can either magnify it with weights  or you can repeat the exact same word or phrase further down in the batch encoding.How does this relate to captioning in LoRa training?  If you want the conditional prompting training to focus on a specific thing in the image , repeating a description at different section in the prompt is good.This is especially useful in natural langauge text encoder with a large batch size if 512 tokens.This also means that as long as the 'vibe' of the captioned text matches whats in the image , the LoRa effects will trigger on prompts close to that 'vibe' as well.It really is up to how you plan on using the lora with the AI model and what prompts you generally use that decides the captioning.Third part. Have you noticed how AI models can create realistic depictions of anime characters or anime depictions or real celebrities?The AI model is built like a car factory ,  that has a conveyor belt on one end ,  multiple stations within the factory that assembles stuff , and the stuff that pops out on the other side of the conveyor belt is some kind of car.You can throw absolutely anything onto the conveyor belt at the stations will turn it into a car.  A tin can , a wrench , a banana or something else.The stuff you put on the conveyor belt is the prompt.The stations are the layers in the AI model.You will find that each layer in the AI model is responsible for one task to create the 'car'  ie the ginished image.One layer can set the general outline of shapes in the image.Another layer might add all the red pixels to the image.A third layer might add shadows.A fourth might add grain effects or reflective surfaces.  It all depends on AI model but all if these layers are usually very 'task specific'So when training a lora , you are actually training all if these stations in the car factory , seperately , to build the 'car'  the image.Shape matters the most prior to creating an image concept. There it is well advised to have a clear contrast between all relevant shaoes in the lora training images.A woman against a beige wall is a poor choice , since human skin blends well into beige and white surfaces.But a woman against a blue surface that clearly contrasts the shape is excellent.Consider that you can create AI images from existing training imagesThat means that the AI model learns patterns from the training image , and uses those different patterns it knows to make new imagesIf you zoom in on the pixel pattern in any AI image you can find jpeg artifacts withinthe AI image , these normally only appear at the edges in normal imagesSuffice to say this method works 😀Practically , if you make a sword in illustrious , you see how the sword sometimes goes out both ends?The CFG parameter in the AI model is a blend between what you prompt , and 'adjecent pixels in the image'The patterns which the AI model learns are 'localized' within the training imageSo the reason why the collage method works is thanks to unconditional prompting parameterThe prompt you actually feed the AI model isnt purely the text promptAsk GROK on Twitterwhat is the relationship between cfg conditional prompting and unconditional prompting?The relation is X = CFG x_conditional + (1-CFG) x_unconditionalThe x_unconditional uses the image created thus far as the input argumentSo it 'fills in patterns' where it is likely for those patterns to beBut the gist is that this relationship is all about pixel to pixel adjecencySo location in the training image for specific pattern don't matterThe collages is like a difficult math problem you task the AI model to solve , so it can become more adept at solving easier math problems later onSo the AI model will usually never be able to recreate the collage training images 1:1 , but the AI model will become very adept at recreating the patterns within the training image in the attemptThe tool to build collages doesn't matterThe most important thing to know is that the trained pattern will always be relative to the imageSo ideally one should have at least one pattern that goes end to end in the imageExample image hereThe second thing to know is that AI images learn patterns that have good contrast to backgroundimageThe 'shape layer' as I call itBenefit of collage training that you can easily crop out bad patterns. Condensing the set to only good patterns.One more thing; if the background of the collage is #181818 gray it will perfectly match the gray background on TensorThis creates a cool 3D effect in the gallery 😀Syntax has examples too in illustriousI always link this video as a source if you wanna know absolutely all of the theory. Its from the SD1.5 days but still is true for present models https://youtu.be/sFztPP9qPRc?si=B_B353yktSpeKeicThis one has a lot of nonsense overly dense information but the gradient illustration is very cool https://youtu.be/NrO20Jb-hy0?si=6us5FRM7qhmD_auHCheers!
3
Photoreal in Chroma - Things you can do

Photoreal in Chroma - Things you can do

Pixelprose is (likely) part of Chroma photoreal set:https://huggingface.co/datasets/lodestones/pixelproseSince you are using training data; Clear the negatives completely.I'm using Chroma V49 at Heun 10 steps with Beta SchedulerCC12M Dataset (Chroma Training Data)Excerpts from CC12M in pixelprose 'vlm_caption' field (without negatives):PROMPT: a group of people on a cruise ship. There are approximately 25 people in the image. They are all wearing casual clothes and are standing around the pool on the ship. There is one person in the center of the image who is dressed up in a costume. They are wearing a pink and green tutu, a lei, and a large pair of sunglasses. They are also holding a tambourine. All of the people in the image are smiling and appear to be enjoying themselves. The background of the image is a blue sky with white clouds. The floor is made of wood and there are several chairs and tables around the pool. The image is a photograph. It is taken from a low angle and the people in the image are all in focus. The colors in the image are vibrant and the lighting is bright. NEG: (none)//----//PROMPT: A post-apocalyptic woman holding a crossbow. She is crouched on a pile of rubble. She is wearing a tattered gray cloak and a pair of goggles. Her face is dirty and she has a scar on her left cheek. Her hair is long and white. She is holding the crossbow in her right hand and it is pointed at the viewer. She has a knife in her left hand. The knife has a long, curved blade. The background is a blur of gray rubble. The image is in a realistic style and the woman's expression is one of determination.NEG: (none)( For photoreal one might add prompt to this prompt , or specify it using the 'aesthetic' tag)Trying again (with fixes):PROMPT: A post-apocalyptic real photo aesthetic woman holding a crossbow. She is crouched on a pile of rubble. She is wearing a tattered gray cloak and a pair of goggles. Her face is dirty and she has a scar on her left cheek. Her hair is long and white. She is holding the crossbow in her right hand and it is pointed at the viewer. She has a knife in her left hand. The knife has a long, curved blade. The background is a blur of gray rubble. The image is in a realistic style and the woman's expression is one of determination.NEG:fantasy_illustration gray_illustration (Negatives are tokenized one by one separated by whitespace hence the underscore '_' )//----//PROMPT:A scene from the movie Planet of the Apes, where a group of monkeys are driving cars on a bridge. In the foreground, a monkey is standing on the roof of a car, while another is sitting in the driver's seat. In the background, several other monkeys are driving cars, and one is standing on the roof of a car, holding a gun. The background is a destroyed city. NEG: (none)//----//PROMPT:A man and a woman walking and talking. The man is on the left side of the image, and the woman is on the right side. They are both smiling. The man is wearing a dark blue suit jacket, pants, and shoes. The woman is wearing a white dress and matching shoes with a red clutch in her right hand. They are walking on a stone path lined with trees and grass on either side. In the background, there is a building with large windows. The image is a photograph taken from a slightly elevated angle. Negatives: (none)//----//Redcaps Dataset (Chroma Training Data)A pecuiliar set within pixelprose is the Redcaps set.https://redcaps.xyz/TLDR; prompt like a reddit title w/o negatives , get photoreal resultsRefer to redcaps.xyz for examplesPrompts from redcaps without negatives:PROMPT: leaves in an alley NEG: (none)PROMPT: i swear, his color just shines in the mornings. NEG: (none)PROMPT: advice for a new owner? canon t7i, 24mm , f8./200s, 100 iso , r/beardeddragons , spiro the dragon NEG: (none)The reason why ` canon t7i, 24mm , f8./200s, 100 iso ` is because its actual titles people use at r/amateurphotography (weirdos) , and thats part of the redcaps set , and thats why such nonsense terminology can be useful in chroma.Finally photoreal NSFW:We don't know what photoreal NSFW sets are used.But writing prompts like a th0t on r/gonewild works for photoreal.elf girl fundays. just got this high collared black bodysuit off amazon. gorgeous green background. Here is my white bed. real photo aesthetic. showing off my braids and nerd glasses. any love for an eighteen blonde elf ….🤔💕(f) ? NEG : onlyfans_footage casual_illustration Similarly I reckon writing pr0n video titles ought to work well for photorealistic NSFW.Feel free to match the CC12M against the collection on NSFW story excerpts 1-30 with 1K paragraphs in each generator : https://perchance.org/fusion-t2i-nsfw-stories-1Batch encoding size for the T5 is 512 tokens. Verify the size here: https://sd-tokenizer.rocker.boo/I'll leave that for something people can try for themselves with above tips as a guide.Getty imagesGetty Images hosts captions for their photos https://www.gettyimages.comCopy paste for easy photoreal results.2012 Monaco Grand Prix - Saturday 2012 Monaco Grand Prix - Saturday Monte Carlo, Monaco 26th May 2012 Force India girls. Photo by Andrew Ferraro/LAT Images Negativestelevision_screen plastic_wig gray_3D_blur Mel C performs at the V99 festival in Chelmsford on August 21st 1999 CHELMSFORD, ENGLAND - AUGUST 21: ormer spice girl Melanie Chisholm performs her first major solo gig at the V99 festival in Chelmsford on August 21, 1999. (Photo by Dave Hogan/Getty Images) Negativestelevision_screen plastic_wig gray_3D_blur Fangrowth GeneratorFor NSFW try this generator : https://www.fangrowth.io/onlyfans-caption-generator/Works well in combination with : https://perchance.org/fusion-t2i-phototitle-1For example: the tag 'Amateur' =>I can’t think of a a few things we could do to make this pool more fun I don’t even know why I put a bathing suit on ;) Everything about this moment felt right Jiggly in all the best places Now I’m a tanned milf lol //---//Finally the conclusion I draw from Gonkee's video on embeddings in SD models: https://youtu.be/sFztPP9qPRc?si=dckBPPpLeUMAoTnlRepetition of concepts at various places prompts is better than adding weights, as stuff like ( blah blah :1.2) was never intended use for the FLUX / Chroma model
5
5
Chroma Models - The Stuff I Know

Chroma Models - The Stuff I Know

Chroma is a finetune of FLUX Schnell created by Lodestones.HF Repo: https://huggingface.co/lodestones/ChromaAllegedly a technical report on Chroma may be released in the future.Don't hold your breath though. From my own personal experience and others , Lodestone is not keen on explaining exactly what he is doing or what he is planning on doing with the Chroma models.TLDR; There is no documentation for Chroma. We just have to figure it out ourselves. I'm writing this guide despite having nothing close to factual information with regards to exact training data , recommended use and background information on the Chroma models.Aside from the total lack of documentation , the Chroma models are an excellent upgrade to base FLUX model and Lodestones deserve full credit for his efforts.The cost for training the Chroma models is allegedly (at present) over 200K USD for running the GPU. Model Architecture for ChromaKey feature is that this model has been pruned from the FLUX Schnell model , i.e the architecture is different.The Keys for the .safetensor file of FLUX Dev fp8 (B) , FLUX KREA (K) and FLUX Chroma (C)As such , don't expect good results from running FLUX Dev Trained LoRa on Chroma.Another minor changes in archetecture include the removal of the CLIP_L encoder. Chroma relies solely on the T5 encoder.Architecture (Sub-models)The Chroma models have different versionsChroma V49 - The latest trained checkpoint on Chroma. It has undergone 'high resolution training'. Unlike V48 , it is assumed Chroma V49 has undergone 'hi-res training' like the V50 , but not confirmed due to lack of documentation. https://tensor.art/models/895059076168345887Chroma V50 Annealed - A checkpoint merge for the last 10 Chroma checkpoints V49-V39 , from which it has undergone 'hi res training'. https://tensor.art/models/895041239169116458'Annealed' I have been told on Discord means that the model has undergone one final round of training through all 5 million images in training data at a very low learning rate. Plans are to make the V50 Annealed the 'official' FLUX Chroma model under the name 'Chroma1-HD'Chroma V50 - A bulls1t checkpoint merge created to secure funding for training the other checkpoint models. Don't use it.Chroma V50 Heun - An 'accidental' checkpoint off-shoot that arose when training the Chroma model. It works surprisingly well for photorealism at 'Heun' or 'Euler' sampler with 'Beta' Scheduler at 10 steps 1 CFG , hence the model name. https://tensor.art/models/895078034153975868Chroma V46 Flash - Another 'accidental' offshoot in training that boasts the highest stability in output of all the Chroma checkpoints. Try running at Euler Sampler with SGM Uniform sampler at 10 Steps , 1 CFG. An excellent model! https://tensor.art/models/889032308265331973What model should I use for LoRa training? Either V49 or V50 Annealed are excellent choices in my opinion.The V49 and V50 Annealed models can both run at 10 steps with Beta Scheduler at CFG = 1 and Guidance Scale = 5 , at the cost of 0.4 credits per image generation here on Tensor. TrainingThe Chroma model can do anime , furry and photorealistic content alike , including NSFW , using both natural language captions and danbooru tags.The training data has been captioned using Google Gemma 12B model. A repo assembled by me has a collection of training text-image pairs used to train Chroma , which are stored as parquet files accessible via Jupyter Notebook in the same repo:https://huggingface.co/datasets/codeShare/chroma_prompts/blob/main/parquet_explorer.ipynbYou'll need to download parquet to your Google Drive to read the prompts:Example output from the E621 setLodestones repo's (⬆️ items from these sets are included in my Chroma prompts repo for ease of use) :https://huggingface.co/datasets/lodestones/pixelprosehttps://huggingface.co/datasets/lodestones/e621-captions/tree/mainTip; Ask GROK on Twitter for Google Colab code to read items in these sets.//---//The Redcaps datasetA peculiar thing is that Chroma is trained on the redcaps dataset redcaps.xyzThese are text - image pairs where the image is a image found on reddit and the trxt prompt is the title of the reddit post!If you want to have fun time prompting Chroma; copy paste a reddit title either off the redcaps.xyz page , or from the chroma_prompts repo parquet files , and see for yourself.Example of a redcaps prompt:I found this blue thing in my backyard. Can someone tell me what it is? The 'Aesthetic' tagsThe pixelprose dataset used to train Chroma has an 'aesthetic' score assigned to each image as a float valueThis value has been rounded down as 'aesthetic 1, aesthetic 2, .... , aesthetic 10'Additionally , all AI images used to train Chroma has been tagged as 'aesthetic 11'(more later)Anime testPromptwhat is the aesthetic 0 style type of art? anime screencap with a title in red text Fox-like girl holding a wrench and a knife, dressed in futuristic armor, looking fierce with yellow eyes. Her outfit is a dark green cropped jacket and a skirt-like bottom. \: title the aesthetic 0 style poster "Aesthetic ZERO" CaptioningGemma 12B model was used to caption Chroma prompts , however this model dies not run on free tier T4 Colab GPU like the well established Joycaptions.To mitigate this ; I'm training the Gemma 4B model to specialize in captioning images in the same format as the Chroma training data.More info on the project here: https://huggingface.co/codeShare/flux_chroma_image_captionerFinding PromptsI recommend you visit the AI generator at perchance.org for Chroma prompts. They have had the Chroma model for their T2i generator for awhile and there are lots of users posting to the galleries.Its hard to browse old posts on perchance so it will do you well to 'rescue' some prompts and post them here to Tensor Art.ResolutionsRefer to the standard values for Chroma and SDXL models//---//
5
4

Posts