This lora is intended for use with https://huggingface.co/alibaba-pai/Wan2.1-Fun-1.3B-InP, other 1.3b wan img2vid models might be supported, but only if they use the same weight names, otherwise it will only partially work. Download the diffusion_pytorch_model.safetensors and place it in your comfyui checkpoints folder. The other model files are the same as the 14b's i2v files, 14b i2v workflows should work if you switch the model.
I've also reuploaded it to civitai now, https://civitai.com/models/1450534?modelVersionId=1640053
The 1.3b model isn't bad for nsfw content, people are likely training it wrong, this lora was trained on a large variety of content and can output a large variety of content. Both furry and realistic content are supported.
Human characters
While this lora was made for anthro furry characters, for the last few epochs of v2, a significant amount of human content was included to make the motions look more realistic, including physics, human content was tagged with "realistic" at the end. Furry content was tagged with "furry animation" at the start.
Prompting guide
Theoretically most human language prompts should work as well as tag prompts, as I varied them throughout the dataset, all videos were human-cut and captioned by me, using a tool to make it more convenient. I might consider uploading the tool once it's more convenient to use. Not providing a prompt usually leads to very little movement.
"the woman" and "her" are interchangable, same with "the man" and "he"
Trained prompt structure (do not copy directly, stuff in brackets are just examples): furry animation, {character description} is {action description}, {additional descriptions}, [realistic|the scene is depicted with a detailed 2d drawing|the scene is depicted in 3d]
Character description examples (doesn't need to be that specific usually since it's i2v): an anthro furry fox woman, in case of human characters you can usually just put a woman.
Action description describes the position, currently a few working options are: cowgirl position, reverse cowgirl position, doggystyle position, missionary position, teasing with her tongue, [a woman] uses her breasts to stroke a man's penis. There are probably a few more.
Additional descriptions consist of perspective, (pov is going to work the best), speed, depth, pulling out (doesn't work well currently), cumshots (also doesn't work very well).
perspective is written as natural language, pov was mostly tagged as viewed from a first-person pov perspective, since it's i2v you don't need to worry much about this, but also just tagging pov should also work.
speed is described in natural language, the words used do make a difference. {speed} [thrusting|riding|sucking] will make a difference.
depth is described similar to speed, except with depth
movement of the woman can be prompted with: she moves up and down as she rides his cock.
movement of the man can be promped with: he thrusts into her *****. And similar, speed can be included here as well, I've noticed it still works.
additionally, you can add stuff like the woman's ass jiggles with each thrust. I can't really put a full list here.
Version readmes
v2 readme
The model has been re-trained from scratch, with a few notable changes. The img2vid results should look more fitting in nearly every result, and there should be much more motion.
Changes from v1's training:
Base model: While v1 was trained on the default Wan t2v 1.3b, the new model is trained on the actual Wan Fun 1.3 Inp. Which is the model this is intended to be used with.
This was achieved by simply providing the missing information in diffusion pipe, it's technically already supported, it just needs to be activated. This PR enables that.
This not only helps the model properly use movements, it also improves consistency with img2vid
The lora's rank has been increased from 32 to 64
The dataset has had a few changes
The videos have been 16fps from the beginning
The training resolution has been dropped from 400 to 256 (as a tradeoff for memory usage) (upped to 480 for e70, as this seemingly improves motion)
The training frame count buckets have been improved, from v1's [1, 24] to v2's [1, 16, 24, 32, 40]. This allowed for training on longer videos with more context info.
The v2 model was trained at a higher learning rate than v2, I might consider a value in between the old and current
At only 12 epochs, the model has more consistent motion than v1 at 40 epochs!
The training dataset contains human data since the switch to 480 res. This helps with movements and physics, it also reduces artifacts like random cutoff. There are still some "stretch" artifacts in some situations.
v1 readme
A model that should be better at animating furry ****, that's pretty much it. It's not good at txt2vid, so I don't recommend that, maybe this could be improved by training on images as well.
This is mostly a proof-of-concept to *********** that a lora can be made for Wan 2.1 Fun 1.3b Inp, and I think it shows that this is indeed the case.
Btw, generating short videos (<1.5 sec) with img2vid at a slightly lower resolution lets you generate a video in about a minute on an rtx 3060. Doing the same with the 14b model takes me more than 10 minutes. The 1.3b deserves more love.
Usage
Most importantly, use Wan 2.1 Fun 1.3b Inp, with img2vid, as using regular txt2vid is not going to give very good results due to the lora not being high rank enough, or even trained enough. While some concepts will be visible, it will not produce very good quality outputs.
When testing, I noticed that just prompting naturally usually yields the best results, however, there are a few things that have been tagged a few times in the dataset.
Note that neither speeds or depths are going to have much impact, likely due to some issues described in the training section.
Positions
The model was trained on cowgirl, reverse cowgirl, missionary, blowjob, deepthroat, some teasing as well
Perspective
Mainly "viewed from a first-person pov perspective", "viewed from the side". Other descriptions should hopefully work.
Speeds
Speeds are written like "[speed] thrusting" or "[speed] sucking"
Available speeds are: "slow", "moderate speed", "fast" and "very fast"
Depths
Depths are written like "[depth] thrusts" or "[depth] sucks"
Available depths are: "shallow", "moderate", "deep" and "balls deep"
Features
Jiggling breasts (Seems to be pretty noticable in generations)
Jiggling ass
This lora has been tested with images generated with Novafurry and Willy's Noob Realism. As shown in the preview videos. It should work on outputs generated from whatever model though.
Training info
This model is a LoRA painstakingly trained on a single rtx 3060 for a total of 40 epochs on a dataset of about 45 manually tagged clips of nsfw furry content.
The first ~36 epochs were trained with varying framerates, assuming diffusion-pipe doesn't fix that, I then re-encoded the dataset to use 16fps, and trained 4 more epochs, this seems to have made the motion a little better, overall, I'm still not happy.
The dataset was scaled to resolutions with pixel counts similar to 400 pixels, at 24 frames. This still used too much vram, so I used a block-swap of 10, I was able to train at about 2 epochs per hour.
I used diffusion-pipe for the training, since I don't have a budget for anything I trained locally.
The model seems underfitted for txt2vid, the lora rank is also only 32, if I were to train it again, there are a few things I would do differently, namely:
I would start by training on images, so the model can get a better understanding of what anthros look like
I would retag the dataset, going over each entry multiple times instead of just once, since I feel like I might have missed some things
I would use a higher rank for the lora, as I believe 32 might be a bit low for such a broad concept
I would make sure the dataset is already in the correct framerate, as I noticed there was not much movement except with some less commonly used tags, which might be caused by it being effectively in slow motion in the case of high fps videos
While this was trained on Wan 2.1 txt2vid 1.3b, it is intended for img2vid using https://huggingface.co/alibaba-pai/Wan2.1-Fun-1.3B-InP, as I have noticed no additional training is needed, and Wan 2.1 txt2vid 1.3b loras will work properly on Wan2.1 Fun 1.3B InP. I hope this information helps others in the future.
I am overall not happy with how this turned out, but will likely retrain this model from scratch in the future, when I can put some money into a cloud gpu provider or similar to train faster without preventing me from doing other things.
Yap yap yap, go try the model or something.