💾 Archived View for sol.cities.yesterweb.org › blog › 20220907.gmi captured on 2024-07-09 at 00:05:41. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

initial thoughts on stable diffusion's dataset

crossposted from my website [1]

i've come across this little article about stable diffusion's training dataset [2]. unlike dall-e's openAI, stability is rather transparent about this stuff, which is great. so. i wanna talk about the dataset — or rather, the fraction of the dataset [3] that's been organized and can be browsed. read the article first and then come back so i don't have to paraphrase it all lmao

the fraction we can browse is composed of 12 million image-caption pairs, or "2% of the 600 million images used to train the most recent three checkpoints". so it's a lot but doesn't even scratch the surface.

the way this data was collected (web scraping for img tags with alt text; captions are mostly in english) absolutely shows; you can notice a few things:

there's a lot of repeated images
captions are more often than not a mess (many automatically generated too)
copyright / distribution licenses are just not a thing; especially considering the game of telephone caused by pinterest
so, of course, there is what you could call stolen art in there, from dA, artstation and elsewhere
there's porn in there, and a lot of it is identical image-text pairs; presumably from bot-generated spam sites
there's a bunch of celebrities too, obviously
i couldn't find any gore, which kinda surprised me, because of the body horror ai-gen stuff i've been seeing lately
there's a handful of images that are themselves generated in there, which is fun!

here i'm considering "stolen art" to be like. stuff contemporary artists drew and posted themselves? like if someone were to repost your art from tumblr to twitter, that'd be stolen. just clarifying terminology, not making a point yet

the article says the largest number of images comes from pinterest, and yeah you can see that. shopping sites, stock images and stuff like blogspot and flickr are also heavy contributors. but since even the non-pinterest stuff is the kind of stuff that's also on pinterest you could honstly just say stable diffusion is trained on pinterest soup. it's hyperbole but ehh not by much? that's just my opinion though!

well so now! what do i think about this? it's... kinda tricky. on the one hand, the idea of web scraping itself can seem rather scary (but it's also what makes search engines work!). it also makes for a bit of a shit dataset, i'll get there. on the other... well let's talk about copyright and permission?

for starters, here's an interesting video about copyright abolition [4]. if you've been around art social media for longer than 24 hours, and specially if you had to endure the height of NFTs you know damn well copyright won't do anything for you. it's there to keep massive media monopolies profits; no one gives a shit if someone's reposting your art on twitter and then that gets pinned to pinterest and that in turn gets reposted to pinterest itself etc. hell, a bunch of stuff in this dataset is just that. maybe i'm just a nobody online artist, but what are you losing? money? clout? sure, i wouldn't like my art to be shared around without its context either, but it's interesting to interrogate why that is.

but in any case, that's neither here nor there, because that's not what stable diffusion does. you see, a model doesn't store image data. here's a great non-technical explanation [5], but essentially the images become mathematical mush, and it's a lossy process, meaning the original images comprising the training data aren't exactly in the model at all. wanna see that in practice? looking at the dataset, you can find two images captioned Two Young Women Kissing, Leopold Boilly 1790-1794 [6] (and a few extra words). here are two images generated with simple stable [7] from this prompt (as 50% quality jpegs for compression):

./img/kiss1.jpg

./img/kiss2.jpg

as you can see, these images fit the prompt rather well, but are far from being copies of the original! you see, the dataset also has several other image-text pairs including "women", "kissing", "1790", of course the image will get mixed with other stuff inside the black box. the prompt doesn't include anything about the background / room, so it just focuses on the kiss instead, changing the composition accordingly. and this was not the only prompt i've tried! it's basically impossible to pluck a single image out of the model. the only way to modify a specific stolen image with a GAN is by directly linking it as an initial image — and that's got nothing to do with the generator itself, and has been done through tracing and photoshopping for as long as there's been art online.

so like. dataset ethics is its own can of worms, as is web scraping and the collection of all this "publicly" available data! there's a huge discussion there: archiving is acceptable use, but then can't we use the archive? do we need to use the archive? how can we find alternatives to have easily accessible generators without resorting to massive and indiscriminate data harvesting? the technological cat is out of the bag, where do we go from here? there's a lot of stuff that i'm just glossing over here; because my point right now is that GANs are not automated art thieves. all this without even having to discuss art history! (but here's a comic by jaakko pallasvuo [8] that touches on that!)

this release is the culmination of many hours of collective effort to create a single file that compresses the visual information of humanity into a few gigabytes.

this is what the stability team says in the stable diffusion public release [9]. as i've said before, AI researchers looooove to attempt to make a map that's the size of the territory (i'm pretty sure that's a story on invisible cities but i can never find it by just looking it up, and i don't have a physical copy. boo). we all know that's impossible. this model is trained on a snapshot of a particular section of the world: internet images, captioned in english, supposedly after a filter attempts to get rid of the lowest fidelity captions. it's an intrinsically flawed dataset, because all generalist datasets have to be.

still, i think this is a way better model / dataset than the heavily censored dall-e stuff. especially when it comes to artistic freedom. the world is messy, it includes nudity and sex and celebrities' faces and blood and trademarked characters! i don't want to defend making vile stuff or whatever but you get what i mean, right? its potential will always be shaped by the fact it relies on random ass internet images that were not captioned with machine learning in mind (many of them aren't even made with accessibility in mind!). it is pinterest goop! it carries biases as well, as they all do, and as the bulk of the internet does... it's a very white dataset, for starters, and focusing on english-language captions also has its impact. this stuff always needs to be addressed when it comes to machine learning, lest we tech-wash harmful worldviews. but still, nice looking art can be cajoled out of it (let's please not get into a "what is art" discussion).

i don't know how to close this off. i recommend you poke around the dataset, it's pretty interesting. i just wanted to talk about it, especially because it's allowed me to expand and rethink things i've said before about plagiarism and ML. um yeah. i might come back to this later but for now these are my thoughts on the subject!

i have, in fact, written more on the subject! don't really feel like converting it to gemtext at the moment however, so please check the first link for the follow-up

links

[1] original post, updated on december 19th 2022

[2] exploring 12 million of the 2.3 billion images used to train stable diffusion's image generator

[3] laion-aesthetic

[4] the golden calf | abolishing copyright law

[5] reachartwork: AI isn't an automatic collage machine

[6] two young women kissing, leopold boilly 1790-1794

[7] simple stable

[8] comic by jaakko pallasvuo

[9] stable diffusion public release

[10] my previous comments on plagiarism and ML

index