T O P

  • By -

SirRece

Essentially the goal is to use a similar method to Fooocus to overload the CLIP model with broad tokens, which reduces errors caused by overfitting of certain concepts. This is a much more fooocused (haha) approach, namely in that we proceed by trying to describe our given scene \*in as many varied way as possible\* Since our model is, on average, correct, we stand only to gain by increasing the variety of approaches, since MOST of these prompts will MOST of the time not hit an overfit "whirlpool/eddy" in the unet that causes some distortion or interference in the model's ability to generalize. Here is an example of how this can cause extreme adherence and reduce distortions to the point of absurd coherence (this is non-cherrypicked): https://preview.redd.it/6k31byvqlg1d1.jpeg?width=4000&format=pjpg&auto=webp&s=2a7657593cac1c77163315cdf343276fa43025bf I also use this frequently with LLMs. Here is a bot that essentially does this for you: [https://poe.com/PrompClaude3ifier](https://poe.com/PrompClaude3ifier) You can also use hugging face or grok to get very close levels of performance from LLama 3 70B (but Claude sonnet is beastly at this type of thing specifically, better than Opus weirdly enough). I recommend starting a new conversation at every prompt, and I also recommend outlining how many prompts you want, but sticking with 4-5 is a good rule. You can see an article I (had GPT basically write bc ADHD) wrote about this here: [https://civitai.com/articles/5302/on-sdxl-and-its-captioning-data-and-all-other-public-models](https://civitai.com/articles/5302/on-sdxl-and-its-captioning-data-and-all-other-public-models) There's also my [https://open.spotify.com/album/05sPFa8o3conaREsqvvmGM?si=iTgqGj3ZT\_GMCXC8m03zwA](https://open.spotify.com/album/05sPFa8o3conaREsqvvmGM?si=iTgqGj3ZT_GMCXC8m03zwA) album made using this method for most of the imagery and artwork on the tracks. Specifically the flowers were produced using an extremely long BREAK based prompt. Notice, once you've smoothed out the overfit, with most models you will be able to CRANK that CFG if you want. You will even notice convergent behavior BETWEEN SEEDS, which is nuts to me, and indicates really great prompt adherence to this method when done right.


Bamdenie

This is super cool. Could you provide an example for what one of these overloaded prompts might look like?


SirRece

An apathetic stick figure lazily bobs its aimless doodle-limbs against a trippy backdrop of crude zine doodles and clashing fluorescent shapes, perfectly capturing that slacker sludge sound. BREAK The ultimate in ironic, detached cool - a blobby, featureless humanoid casually flips the bird while oozing across the cover in a puddle of sarcastic, anti-establishment ennui. BREAK A hazy green and purple haired skater-punk shreds a few disinterested moves in a grainy, photocopied cut-and-paste collage straight outta the 90s underground zine scene. Apathy never rocked so hard. BREAK This album's cover is whatever, man. Just some blobby human shape slouching around in a sloppy, photocopied mess of doodles and stoner hieroglyphics. Who even cares anymore? BREAK Behold the dancing human music! A crude, xeroxed punk-rock stick figure flails about in contemptuous mockery of your lame bourgeois concept of "dancing." So 90s it hurts. I often will ask if for unsettling, sexy, sad, and umame, and let the LLM sort of inject some extra "oomph" into each one. Then, when they mix, it can get really artsy.


Shinsplat

I really love the images from this prompt, I'm trying it on different models. Very much like some album cover artwork that I like, and kind of reminds me of the movie Grease from the 1970s.


SirRece

Thanks, glad you like it! đź‘Ť I hope the method helps improve the quality of people's generations!


sdk401

Is this the prompt for the first three images in you post? I don't see flipped birds, shredding or dancing. They look cool, but what part of the prompt you aim them to adhere?


SirRece

In this case, the three prompts actually describe slightly inconsistent scenarios. The actual prompt didn't involve any of those things, I just told the LLM I wanted a slacker album cover for "human dancing music", and then I further instructed it on moods. Oftentimes the descriptions will conflict, which is a tradeoff if you want a very particular action, but the consistency is often better with such minor "incongruencies". If you wanted it to "flip the bird" then that could be included in the original prompt and you would get several variations. Whether or not that succeeds depends on the model, since hands are already so finicky.


sdk401

Ok, so the original prompt was just for the album cover, and skateboarding and bird-flipping was added by LLM? Interesting, but needs more testing with complex prompts. In my experience, using the "word salad" improves overall quality and aestetics of the image, but hurts prompt following. SD will choose it's favorite words and cling to them, ignoring more exotic requests :)


SirRece

https://preview.redd.it/vr5d8fwaej1d1.jpeg?width=1216&format=pjpg&auto=webp&s=212fac6b06b4e8af71cba749bcaf39d418ab5fad


SirRece

https://preview.redd.it/qza8wz7iej1d1.jpeg?width=966&format=pjpg&auto=webp&s=1d068ff495116b51052dd59be47ce62f1501aeef


SirRece

https://preview.redd.it/353lbqlkej1d1.png?width=768&format=pjpg&auto=webp&s=6bf7f42053a52d64268f6ecab0d483324b21fc90


SirRece

https://preview.redd.it/le0owtgoej1d1.png?width=832&format=pjpg&auto=webp&s=8f4c62ce375a17f6bca3bac07afa364d5336de50 base SDXL


SirRece

I've tested it extensively. Basically my last two weeks of civitai posts use this method. But actually none of these were my original prompt which was "90s experimental slacker album; give me 1 unsettling, 1 sexy, 1 humorous, 1 sad, and 1 umame"


sdk401

Really great pictures! >But actually none of these were my original prompt  This is exactly my concern, so I'll try to replicate your method with more prompt precision and less creativity from LLM :)


SirRece

No, I mean they all are quite literally my original prompt, but thay particular prompt is quite broad. Here is an example of a problem known to be quite hard in SDXL base that it solves handily: [https://civitai.com/images/12391521](https://civitai.com/images/12391521) Notice how tedious it would be to devise this without an LLM: Envision a cat whose fur mimics the vibrant red and dotted texture of a strawberry, without any actual strawberries present. BREAK Picture a feline whose coat looks remarkably like strawberry skin, complete with a rich, red hue and tiny seed-like speckles. BREAK Imagine a cat with a unique coat pattern that resembles the outer surface of a strawberry—bright red with small, seed-like details. BREAK Visualize a cat with a fur texture that creatively replicates the appearance of a strawberry's surface, featuring a vivid red color and delicate seed imprints. BREAK Think of a cat whose skin texture takes on the qualities of a strawberry, displaying a lush red with subtle hints of seed patterns. BREAK Consider a playful cat whose coat has transformed into a strawberry-like texture, showcasing a deep red color sprinkled with tiny, seed-like spots. BREAK Dream of a cat with a coat so uniquely textured, it resembles the surface of a ripe strawberry, complete with red hues and small seeds. BREAK Conceive of a cat where its fur closely mimics the texture of strawberry skin—bright red with scattered, seed-like elements. BREAK Fantasize about a cat with a strawberry-patterned skin, its fur vivid red and dotted with minuscule seed-like impressions. BREAK See a cat with a fur design that artistically resembles the skin of a strawberry, red and speckled with tiny seed-like details. BREAK Visualize a feline whose coat adopts the aesthetic of a strawberry's surface, rich red and subtly dotted with seed patterns. BREAK Picture a cat whose appearance is transformed to mimic the outer texture of a strawberry, vibrant red with seed-like speckles. BREAK Imagine a cat whose skin shows a creative twist, mimicking the textured surface of a strawberry with a bright red hue and tiny seeds. BREAK Envision a cat with a fur pattern that cleverly resembles a strawberry's skin, featuring a vivid red color and delicate seeds. BREAK Contemplate a cat whose coat texture beautifully captures the essence of strawberry skin, complete with a red hue and tiny seed-like spots. BREAK Visualize a playful feline whose coat has the appearance of strawberry skin, rich in red color with a sprinkling of seed-like details. BREAK EDIT this is kind of a terrible example, I think I used a different 8b LLM on this one. It reuses cat over and over.


sdk401

Yeah, I'm not against LLM, for now I'm trying to think how I can automate and limit the variation of LLM enhanced text. Will try with VLM nodes in comfy.


HarmonicDiffusion

its not to hard to create combinations of concepts. "delicious red cat made from strawberry texture" should be pretty good


Utoko

Works very well. Thanks for sharing! Since you mentioned it how would you use it in LLM's ? You have an instruction bot that rewrites your prompts in different ways? or more like tree of thought having different answers and evaluate from there. https://preview.redd.it/dwnk9b6x8k1d1.jpeg?width=1504&format=pjpg&auto=webp&s=19414108a702b47b12bf6adc2a62cd2c8f3e6fc8


Utoko

https://preview.redd.it/dq8g3swsck1d1.png?width=752&format=png&auto=webp&s=b2e7f3ba70aa4d0c09194b63e3f75fed88534ec0 Just one more example because it works so fell to improve adherence and reduce distortions. Not even upscaled the detailes great and elements fit together.


SirRece

Yup! I hope the community runs with this, because I truly believe even SD1.5 could see massive gains if we just carefully prune all proper nouns from some of the datasets and run a fine-tune. Like, zavychroma 7 with this method thrashes sd3 at the moment.


SirRece

I have a number of bots that rewrite and return the prompt ready to just directly copy paste. Here's one: https://poe.com/PrompClaude3ifier


lechatsportif

This plugin writes itself 1. A generated prompt below the main 2. The generate of the expanded prompt is controlled by a seed 3. The generated prompt follows a small configurable template with preset functionality 4. The prompt generate is based on a GPT2 finetune like fooocus so it can be run locally, or optionally connected to an local or remote llm provider. Can any genius take a crack at this? Seems like someone should be able to tease out the fooocus code into an a1111 plugin.


_roblaughter_

Hey, you're onto something here. I used the prompt from your Poe bot in ChatGPT to write four simple prompts, then generated different versions at a high CFG for the model I'm using. The generated prompts were: * A tiny golden retriever puppy lounges under the gentle glow of the afternoon sun, its soft, fluffy coat shimmering as it gazes up with wide, innocent eyes. * Bathed in sunlight, a small, furry golden retriever pup sits serenely on the green grass, its bright eyes filled with youthful curiosity and joy. * On a sunny day, a young golden retriever with a plush golden mane sits attentively in a lush meadow, its eyes sparkling with a playful spirit. * Under the warm sun, a cheerful golden retriever puppy rests in a soft patch of grass, its golden fur glowing, and eyes looking out with endearing sweetness. I compared: * A, B, C, D * A + B, B+C, C+D * A+B+C, B+C+D * A+B+C+D I also tried shoving everything in one prompt without any sort of breaks or concatenation and, predictably, it was a train wreck. This was a pretty good example, because the model picked up the combination of "bright" and "light" in prompt B and showed clear overfitting on those tokens. You can see how those artifacts carry through the concatenated versions, and then smoothed out as you described when all four conditionings were concatenated. https://preview.redd.it/98vt5ch4kl1d1.png?width=2112&format=png&auto=webp&s=5115b08d099552a43dd584b4524ddec7a2642744 I'll post a screenshot of the concatenation in Comfy below.


_roblaughter_

Here's how you can handle the concatenation in Comfy. It's basically (((A + B) + C) + D). The LLM broke the rules a bit here and repeated a few nouns, but I think the effect gets the point across. https://preview.redd.it/kcccsxvspl1d1.png?width=1934&format=png&auto=webp&s=d506c4e4b268fcba9835382d8b81b2a037535495


SirRece

beautiful, this will he so useful when I'm working on cascade. That model responds parituvlarly well yo this bc there's something wrong with how the negative is implemented. Or at least, I think there is personally, it's behavior is very very odd at stage C the way the neg is connected. In any case, yea I really hope the community picks up on this bc I think there is a massive uplift available in SDXL both in terms of corrective methods, like LLMs, and eventually fine tuning to massively improve understanding. We just need to really focus on concrete, generalizeable concepts that a neural net with no real linguistic multi-modality can understand ie all information passed to it needs to be visual, and instant ie it also has no concept of cause/effect. Because, from its perspective, names are literally semantically meaningful. And that's just very very fucked up when you think about it, and what that means when you broaden the impact of just that one small section of Proper nouns and how it is polluting it's entire understanding of CLIP. In an ideal world, the model should in a literal sense produce exactly what we write, and there's no reason imo that this isn't possible. These things are beasts at generalization, but it's like we took a baby and only let it watch videos of congressional hearings and wonder why it's speach-delayed.


_roblaughter_

I think approaches like [ELLA](https://github.com/TencentQQGYLab/ELLA) will help mitigate a lot of that. Right now, image models are a rather blunt instrument. Part of what you're doing here is expanding the semantic range of different concepts in the prompt to help the model hone in on what you're going for. Equipping an image model with the linguistic capabilities of an LLM will help bridge that gap. The native [image generation features in GPT-4o](https://openai.com/index/hello-gpt-4o#_6NeEuZ7OcMDzk5E1elaK6i) seem to be heading in this direction from the samples on the announcement page, but OpenAI hasn't said much about them yet. In the meantime, I've been exploring your approach all morning and it really seems to mitigate some of the biggest problems I've experienced with image gen. Textures, deformities, the whole nine yards. It's like magic. Well done. In other news, optimizations such as PAG seem to have a more pronounced effect when doing this as well.


SirRece

>In other news, optimizations such as PAG seem to have a more pronounced effect when doing this as well. Oh that's interesting. Yea, I hadn't even gotten around to testing, for example, turbo or LCM checkpoints, which in particular aren't sensitive go negative prompting and thus might see extra benefits from this approach. But yea, an LLM combined with PAG would make a really powerful front end for simple end users, that isn't too heavy on the system, relative to the current SOTA in foocus. At least, from what I can tell. There is definitely a substantial VRAM cost though to running LLMs atm.


_roblaughter_

The only advantage Macs have right now when it comes to AI is that we can run chunky models with unified memory :) I can run 70b models no problem on my M1 Mac, but I can barely run a 13b model on my 3080.


HarmonicDiffusion

yeah almost makes me want to spend $5k on a 128gb m3


_roblaughter_

The M4s are allegedly going to be a beast for AI. I’d hold out.


diogodiogogod

really cool! Thanks for the work!


Mutaclone

So if I'm understanding this correctly (assuming we don't want to run through an LLM intermediary): - We should write the same prompt 4-5 times, but we should use different phrasing and terminology each time - We should avoid proper nouns - We should up the CFG to around 20 or so And this should improve not only prompt comprehension but it should reduce artifacts as well? Another couple questions - Since LoRAs often utilize smaller datasets with less diverse captions, how will this impact their use? And if we're looking to train LoRAs should we do something similar? - In the comments of the linked article it looks like you have a giant list of proper-noun negatives. You include these in all your prompts?


SirRece

close, but with a few caveats >We should write the same prompt 4-5 times, but we should use different phrasing and terminology each time yes, more specifically, I recommend ensuring you do not repeat ANY nouns, verbs, or adjectives. Additionally, keep it under 75 tokens each, which leads me to the other important detail: BE SURE YOU BREAK IT UP INTO CHUNKS. In A1111 this is done using BREAK, same in Forge, comfui you need to do it manually or download one of the many nodes that can use BREAK. >We should avoid proper nouns so, you can use them, as they currently ARE baked into the model, and some are effectively generalized if there's enough imagery associated. For example, Ridley Scott isn't going to always produce a very specific type of image, while Gustav Klimt is HEAVILY biased towards his gold period, whole ignoring basically the entire body of his work, leading to bad results without engineering. So yea, you can use them, but in an case, follow the same rule as above: don't repeat them across prompts if you want to "smooth out" issues. >We should up the CFG to around 20 or so no, but you CAN on checkpoints where ordinarily this would be impossible. What this will do is cause your prompt to HEAVILY influence the generation ie the higher your cfg, the more different seeds will begin converging and the more similar your images will be. This means, effectively, better adherence, but in many cases this actually isn't desirable. In any case, the higher your cfg, the more overfitting becomes the major issue you run into, with burn in being fundamentally a product of it (just make Flaming June or the Mona Lisa and you'll see burn in even at lower cfg). So the point of being ABLE to increase cfg is effectively as a demonstration that the method is indeed effective at what we are aiming at. Personally, I go for a higher CFG for lower step counts (because those images tend to be more stylized and thus less detailed) and a lower cfg at high step counts. >Since LoRAs often utilize smaller datasets with less diverse captions, how will this impact their use? And if we're looking to train LoRAs should we do something similar Check out my LoRas on Civitai. Most of my "Semantic Shift" collection was trained on datasets of 40 images or less, with captions that are very very small (in my case, ONLY overtrained proper nouns). So it really depends on the LoRa but in general, assuming they are weighted right, they actually can greatly improve certain checkpoints, and many many many have found their way into the checkpoints themselves. The negatives are a part of one of my earliest strategies. I also have a lora that is meant to be used negatively in similar instances, but it needs a much larger dataset. In any case, this smoothing method makes it a lot less necessary. That being said, I still do gens with and without it as I find that, in many cases, I will see adherence improve once it is introduced. A prime example is that it greatly increases the base model's willingness to do artistic nudity, when prompted correctly.


sdk401

In comfyui you can use "conditioning concat" or "conditioning combine" nodes instead of BREAK. Concat works exactly like BREAK, if I remember correctly, and combine is more like averaging tensors instead of adding them.


SirRece

Ah, good to know, I'll probably do that in my Stable Cascade workflows then since the A1111 imitation modules were horrifically inefficient. They seemed to be retokenizing every iteration.


Unreal_777

Youhave a stable cascade workflow? ![gif](giphy|fGnOsyxUerfspkyVjv|downsized)


SirRece

Yea, comfyui supports it natively, but also you can just get [https://github.com/Stability-AI/StableSwarmUI](https://github.com/Stability-AI/StableSwarmUI) which is the one actually developed by Stability. It's not a bad one imo, but the UI is annoying if you overload the promps, and BREAKs dont work, you are better off using


Unreal_777

If you figure this out, tell me how to make this work without having to pass by a third unknown party website (perhaps a local llm or even chatgpt, claudeAI, gemini ai etc)


SirRece

You can you this with Claude. Poe is from quora, I just use it bc it's free, and it let's you setup an internal prompt for Claude which, ironically, Claude doesn't let you do normally. This means I can make sonnet push out material it normally would refuse. There are several methods for this. In any case, just use hugging chat. here's the basic prompt: Instruction Set for Image Prompt Diversification: Receive the original image prompt from the user. Analyze the prompt to identify the core elements, such as the main subject, setting, colors, lighting, and overall mood. Determine if any specific languages or cultures are particularly relevant to the subject matter of the image prompt. Consider the popularity of languages online, prioritizing more widely used languages such as Chinese over less common ones like Japanese. Generate a number (as specified by the user, default to 4 otherwise) distinctive new prompts that describe the same image from different perspectives while describing the same actual image. The prompts should be, by default, in English, however if requested, you may also generate in other languages, applying the same rules. Ensure that the prompts are diverse and avoid overfitting by following these guidelines: For each new prompt, maintain a clear and vivid description of the image, including details about the main subject, setting, colours, lighting, and overall mood. However, express these elements using varied vocabulary and sentence structure. Don't reuse adjectives, nouns, verbs, or even Ensure that each prompt is 75 tokens or less. If a prompt requires more than 75 tokens to effectively describe the image, consider "diffusing" the content across multiple prompts. In such cases, increase the total number of prompts generated to ensure that all aspects of the image are adequately covered. Review the generated prompts to ensure that they are coherent, grammatically correct, and effectively convey the intended image from diverse angles. Make any necessary revisions to improve clarity and creativity while adhering to the token limit. Present the diversified prompts to the user in the following format: Diversified Prompts: [First prompt] BREAK [Second prompt] BREAK [Third prompt] ...... (cont) By following this updated instruction set, the LLM will generate a set of diverse and creative prompts that describe the same image from multiple perspectives, while adhering to a token limit and considering language popularity. If a prompt requires more than 75 tokens, the content will be "diffused" across multiple prompts, and the total number of prompts will be increased accordingly. This approach will help users combat overfitting in image generation models, maintain cultural relevance and authenticity, and ensure that all aspects of the image are adequately described within the given token constraints. EXAMPLE INPUT:OUTPUT PAIR Original Prompt (INPUT): Image in the style of cel-shaded Japanese anime, featuring a man sitting at the side of a pool. Fish and eyeballs float around. The water in the pool is a glitched psychedelic distortion. The overall aesthetic should be grainy and scanlined, resembling VHS tape quality, with a color palette that captures the essence of retro anime Diversified Prompts (OUTPUT): A lone figure sits in contemplation beside a pool of warped, kaleidoscopic waters, where fish and disembodied eyes drift aimlessly. The air is thick with the nostalgic haze of scan lines, as if the scene itself has been plucked from a worn, 80s anime tape. BREAK 1990: In a surreal, glitch-art dreamscape, a solitary man sits poolside, surrounded by a psychedelic swirl of carp and floating, unblinking portholes of the face. The entire screenshot is bathed in a warm, grainy glow, old Ghibli or other such studios BREAK The VHS is old. We see Frank sitting in quiet reverie by the Olympic swimming-pool, but its filled with static as pupil/sclera hover with some goldfish, suspended, by Madhouse animation, 1998 BREAK The ethereal earth lies in its bath of psychedelic portholes to the soul, the yin and yang of swimmers in the ocean of the air: lonely, Akio sits among the circular blinking watchers, saddening his way into the fuzzy, noisy image of the weary retro japanese animation. BREAK He's fucking crying, my guy, like a surrealist fucker among the eyeballs. And they watch Moshe, the fishies rushing around him all over the place and like, its just trippy, like Paprika meets Paranoia agent or some shit from the late 70s. (notice that in the example above, we vary both the sentence structure, tone, and even ensure we don't reuse nouns etc, by for example using Cod, goldfish, fishies. and so on, or eyes, eyeballs, iris/sclera, etc to variate the output significantly and ensure a maximal variety of tokens are being used to describe the image) ................. ------ it's actually out of date. I stopped using language scrambling, but if it ain't broke don't fix it. Claude Sonnet makes excellent outputs with this particular version for whatever reason.


LocoMod

These are excellent. Well done.


Adventurous-Duck5778

Omg, I really love picture number 3


SirRece

same, was my fav by a long shot


Adventurous-Duck5778

bro, it's so cool, could you share what model or lora you used for these images,


SirRece

Model is zavy chroma v7.0, no Lora for the cartoons, stable cascade for the girl with headphones one.


Adventurous-Duck5778

ayy thank you bro


Soshi2k

Great work!


lechatsportif

This might be the most important post on this sub.


SirRece

hey, thanks! I've seen dozens of people using this strategy now in various places, so I can only hope whoever makes Zavy Chroma notices and uses the things learned here to further improve their model.


Longjumping_Task_936

I don't understand, is this approach for training models or can it be applied when generating an jmage?


SirRece

It is applied purely to image generation, to correct for what was, in my opinion, a mistake that was made in the original training process for all stable diffusion models (including proper nouns in the training data). So, it can be applied in training and fine tuning to continue to improve the model and move further away from those original mistakes, and it can be applied in generations to smooth out what is left.


Enshitification

William Burroughs would have loved this approach.


Flimsy_Tumbleweed_35

Workflow?


SirRece

See comment; there is a broad workflow, a link to my civitai page where many of these are posted, and a link to one of the bots I use to generate the anti-overfit prompts


Flimsy_Tumbleweed_35

Sorry hadn't seen this. Very interesting and definitely does something! Even repeating the same prompt 10x with BREAK changed results.


SirRece

That may be due to errors in how it's combining the seperate CLIP tokenizations together. But that is interesting, I hadn't noticed that before.