Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
The Art Of Poison-Pilling Music Files | Benn Jordan | YouTubeToText
YouTube Transcript: The Art Of Poison-Pilling Music Files
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
If you're a musician, I have some great
news for you. I've been really busy
lately and a little bit
naughty. I've made a living as an
independent professional musician for
over 25 years now. And once tech
companies started raising millions of
venture capital dollars and scraping my
music without my consent, then
generating shittier music with it that
is inadvertently associated with my name
and then attempting to resell that in
the same economy in which I make money
from my music. I was just like, you know
what, this I still enjoy making music
all the time, but I have entirely
stopped releasing it. But what I'm
showing you today is a type of encoding
that not only makes a music file more or
less untrainable by generative AI
companies, but actually has the ability
to decrease the quality and efficiency
of their entire data set. I'm going to
be showing you a lot of demonstrations
of this technology in this video, but it
attacks AI in a whole lot of other ways,
and some of them are really scary. For
example, I could say something
completely normal here or even play you
the sound of an eagle. And that sound
would be telling Siri or your Echo or
Google Home device to unlock your doors.
The adjective vulnerable was usually
defined as capable of or susceptible to
being wounded or hurt. Or I could make
Spotify think that a bunch of sex noises
are an acoustic Christian folk song. And
we're going to be exploring all of these
things. Unethical generative AI
companies have made artists feel
incredibly powerless for quite some time
now. But all of that is about to change.
And I am extremely excited to be able to
finally tell you about it in this video.
Come on. [Music]
[Music]
The modern chapter of the AI music story
begins in 2015 with the publication and
proposal of UNET, a frankly ingenious
method of using a convolutional neural
network for advanced pattern recognition
in biomedical imagery. In simpler terms,
UNET invented a way to recognize things
in a much much more efficient way that
didn't even require more than a few
images of training data to work. I'd go
more into the detail of how it works,
but some of my viewers may be driving,
and I don't want to cause any accidents
by making them drowsy. All we need to
know for now is that UNET's architecture
inspired the technology that is behind
virtually all generative AI models
today. Then in 2016, Google introduced
Magenta, which is a research project
that uses machine learning to scan
insane amounts of music to learn from it
in order to create new tools to create
music. There are a lot of neat projects
with Magenta that you can play with for
free today. You can have a piano play a
duet with you, or you can turn your
voice into a saxophone sound, or play a
harp that morphs between learned sounds,
or make a bunch of circles to make a
basic melody. It's fun stuff to play
with, but a lot of it is also just a
really expensive and inefficient way to
do what modular synths like that have
been doing for decades. There is one
thing that you should consider before we
move on. Since this era, generative AI
music has always been primarily, with
very few exceptions, a lot of
showmanship for investors and more or
less a solution without any sort of
problem. For example, another huge AI
music technology landmark in 2016 was
Sony unveiling a song that was made with
artificial intelligence called Daddy's
Car. Now, the song sounds pretty
mid, so imagine my complete and utter
surprise to find out that it was
performed, recorded, and mastered by
humans. And the lyrics on the other hand
sound like some stereotypically bad
early AI generated But it turns
out that that was written by a human
too. Now you may be asking yourself how
exactly this qualifies as artificial
intelligence. But as the lyrics of the
song famously say, from tax man until
tomorrow, never know. It wouldn't be
long before the goal of these tools was
to actually generate AI music. With Open
AI releasing Museet and Bite Dance
acquiring Juke Deck in 2019, then Meta
and Stability followed suit. And it
wasn't long before we had a bunch of
voice cloning services and consumer
targeted subscription services that
offered AI generated music like SNO and
UO. Here's a fun social experiment I
would like you to participate in. The
next time a company announces a
generative AI feature or product or
service that is for sale or for
subscription, ask them the magic 10
words. What data did you use to train
your base model? More often than not,
those 10 words magically work as a mute
button for tech because they
either don't know the answer to the
question or answering the question will
make them liable for literally billions
of dollars in IP infringement damages.
That's because they just recklessly
scraped Spotify and YouTube and Audible
and virtually anywhere that they could
find data, whether it was copyrighted or
not. And remember how the recording
industry of America would sue soccer
moms because their kids downloaded a few
albums? It's just like that times a few
hundred million and also while raising a
few hundred million in venture capital.
And then naturally like a swarm of
locusts, tons of opportunists started
using music generating services to
generate millions of songs to put on
streaming services like Spotify, which
then siphoned royalties away from actual
musicians. And as if all of this wasn't
enough of a slap in the face to
musicians, services like Sunno literally
set up funds to pay their top AI music
creators without ever even considering
paying the actual musicians that made
the music their entire service trained
on. Starting in 2023, I had met with
various US senators and their staffers
about changing legislation to require
generative AI companies above a certain
size to start keeping a record of their
training data and requiring consent from
intellectual property holders before
using it for their products and
services. A lot of artists advocacy
groups and unions are also doing this
and quite a few early bills were penned.
But when the presidential inauguration
was crammed full of tech titans who are
spending billions on generative AI, I
kind of realized that we're going to
have to take this into our own
hands. This is a little bit of personal
history, but it's very relevant to this
video, so stick with me. In early 2023,
I joined DJ Fresh and Nico Polarin in
co-founding Voice Swap AI, and we
expanded the team by proposing a royalty
mechanism for vocalists participating in
Generative AI that not only paid them
ongoing licensing fees, but made them
part of an equity pool in the company.
What we found is that by training an
entirely new vocal base model on
consensual data and fine-tuning it in
collaboration with the artists, the
resulting voice model sounded superior
to our competitors. As a result of this,
we quickly grew to 150,000 users without
so much as a penny in thirdparty or
venture capital investment. We became
busy with large business clients and
have remained the only AI company that
qualifies for BMAT certification and can
pay royalties on this new type of
intellectual property. And much more
importantly, the majority of our
vocalists were earning more annually
through their voice model than the
entirety of what they were earning on
Spotify and other streaming services.
Come on, not bad, right? Last year, one
of the projects that I put a lot of time
and research into was finding a
foolproof way to take an original music
master file and detect if the music in
that file was generated with AI or not.
If you come up with a solid idea or
maybe a new or original process to
accomplish a task and you release it to
the public for free, it's only a matter
of time before big companies will say,
"I made this." And then gatekeep behind
a payw wall. Fortunately, somebody
tipped me off that this was happening
and I was able to apply for a patent
myself to prevent it from happening in
the future. But I learned this expensive
lesson a few months ago. The problem
with my little AI music detector is not
in its functionality, but in the lack of
incentive to use it. The people who are
generating the AI music to put it on the
streaming services and make money have
to pay middlemen distributors to publish
it and manage their royalties. And while
for some reason I thought that these
services would jump at the chance to
refuse money and not accept AI music,
they didn't seem interested in this
proposal. Surprising, I know. But
meanwhile, I had been researching a type
of technology that actually isn't all
that new. Adversarial noise. This term
first sprouted up a decade ago when
virtually every piece of technology
included a little AI assistant that you
could talk to. The infosc industry has
been aware of this for a while now, as
the information that a neural network
gathers from a sound is very different
than what a human brain gathers. This
means that just about anything that you
can accomplish via a voice command, like
ordering something on Amazon or opening
your garage door, can presumably be
triggered by a sound that human beings
cannot identify. And this is
accomplished by using adversarial noise.
Let me demonstrate. Here's an attack on
an Amazon Echo Show, which by the way is
like the worst tech device that I've
ever used in my life. Let's just play
some soft classical music in the
background. And whoopsie. Benley Jordan,
born October 28th, 1979, is an American
musician operating under many
pseudonyms. Here's my attack on the AI
model that's been used for speech
recognition by Meta, Facebook,
Instagram, Oculus, and then a whole lot
of others as well. If we run it
directly, we can see exactly what the AI
Let's move over to the white hat side of
this attack. Giant generative AI firms
have been scraping copyrighted
audiobooks to train their voice models,
but also learning from the content
within the audiobook itself. So, let's
encode that audio file to make the AI
hear nonsense. And since it's using a
self-supervised neural network like
Hubert, it'll reinforce a false
positive, meaning that the entire model
will be shittier after it's done
training on my adversarial noise here.
My alarm clock woke me that day as
always at
6:13. I went to the kitchen, made coffee
and toast. My alarm clock woke me that
day as always at
6:13. I went to the kitchen, made coffee
and toast. My friends at the University
of Tennessee Knoxville have made an art
out of these types of attacks. This one,
for example, utilizes a small physical
speaker to introduce an aa audible layer
to your own speech and commands in real
time by mimicking small environmental
sounds in realworld distortions. So even
if you have high security settings where
your AI assistant can only respond to
your voice for highsecurity commands
like disabling your alarm system, that
also can be manipulated. Speaking of the
University of Tennessee, I took a trip
up to steamy old Knoxville to visit
Sedon Ali Mirza and Gianlu to learn more
about Harmony Cloak, one of the research
team's more recent projects that encodes
a music file with adversarial noise that
utterly breaks AI's ability to find
melody or rhythm. Here's some of their
own demos with some pretty simple basic [Music]
[Music] [Applause]
[Applause]
music. And here is what the AI models
generate based on the unencoded music files.
Now, here's what they create from the
[Music]
Cloak. Anyone familiar with AI training
will recognize this, by the way. It's
tensorboard and it functions as a
convenient visual guide to know when
training is no longer improving your
model. That way you know when to stop
the process as to not waste time and
energy. The red line is from training on
normal audio files and the blue is files
encoded with harmony cloak. You'll
notice that almost immediately the model
stops being able to improve itself. Have
you tested the original versus perturbed
music again? Like have you tested it
with other students? Have people claimed
to be able to hear the difference? We
have a user study. Uh in the paper we
involved over 30 participants. I think
most of them are music lovers. We asked
them to give a rating. I think the
rating we got from the for the unlearn
book examples are pretty similar to the
clean music. The model has been
rigorously tested and GN sad and their
team have been actively working to make
it more efficient. It was then and there
in Knoxville when I suddenly had the
urge to release music again as a test
subject for widespread development. But
I had a lot more work to do and a whole
lot of training and
testing. Okay, so remember UNET from
earlier in this video. UNET pioneered
something called diffusion. Instead of
trying to generate an image or sound by
drawing it, they start with noise and
then shape it based on what it learned
from training. Now, music generating
algorithms seem like a pretty big
technological leap from image generating
algorithms. But when you introduce a
music file to a modern neural network to
learn from and train on, it's merely
looking for the spectrogram of the
image, which can then be interpreted as
audio. If you go back some years on this
channel, you could see some fun videos
of me playing creatively with software
that does this. So in that spectral
image, the AI looks for two classes of
characteristics. The first is the tones,
the rhythm, the melody as seen here, and
more importantly, why this chord often
seems to come after this chord in a
particular style of music or how or why
a swing beat works in a particular
situation. The second is the sounds
themselves, something that Google's
Magenta had pioneered very early on.
This is why AI stem separation became
one of the first features sites and
software offered. Using these types of
algorithms, the AI can detect what part
of the spectral image has drum sounds in
it or bass or guitar. Anyway, that
second class of functionality is what I
spent an absurd amount of time trying to
break. Now, remember what I showed you
with adversarial noise, inserting
inaudible commands to say something
different to a speech recognition
engine. Well, turns out that I could do
these targeted attacks on instruments,
too. One of the most advanced and
accurate instrument classifiers is an
API offered by CAM. So, what does this
sound like to you? If you guessed
symbol, good job. However, after
encoding it with adversarial noise,
instrument classifiers think it's a
harmonica or that this song has string
instruments in it. This also could
potentially have a snowball effect by
making a generative AI model be
continuously fooled by false positives.
So, it'll be more likely to think that
every similar synthesizer sound it
encounters is also a string quartet. And
for this reason, I've been calling my
attack Poisonify. Harmony Cloak combined
with Poisonify makes music not only
untrainable, but it threatens to degrade
the quality of the entire
model. Sunno is a generative AI company
that started out in late 2023 with text
to speech and generative music models
that have now raised hundreds of
millions of venture capital dollars.
They are now in a legal battle with the
recording industry of America for
damages amounting to something like $1.5
trillion for training their models on
copyrighted data without asking anyone
for permission. Now, Michael Schulman,
Suno CEO, has some really enlightening
takes. It's not really enjoyable to make
music. Now, I think the majority of
people don't enjoy the majority of the
time they spend making music. And by
enlightening, I meant that there simply
has to be something enlightening about
being so astray from reality that you
would think that musicians don't enjoy
making music. And within the delusional
reality that Michael lives in, this is
the problem that he's providing a
solution to. And to be fair, this
mindset opens up a lot of really
lucrative and creative business
opportunities. Maybe he could expand
Sunno to make autonomous machines that
will roll bowling balls down a lane, so
you could just pay him a subscription
fee instead of bowling with your
friends. Aside from the Mental
Gymnastics Olympics that the company was
founded on, Sunno has a really useful
feature where you can upload a song and
then the service will automatically
extend it. It doesn't seem to listen to
prompts very well and it doesn't sound
very good, but it does provide a great
test bed for my little project here. So,
here we go. We can upload my original song
song [Music]
[Music]
here. And now here is Sunno's AI
extension of that song. Fragments of
light lost in data
sweep shadow memory in the glowing
rain. Okay, now let's upload my
poisonify encoded [Music]
track. And here is Sununo's AI generated extension.
extension. [Music]
[Music]
I would describe this as music from an
airport spa that somebody downloaded off
of Napster in 1999. There's an even more
recent and competitive Generative AI
music outfit from China that I only know
about initially just because they blew
me up so much in my email trying to get
me to promote them on this channel. Ask
and you shall receive, I guess. It's
called Miniax Audio and a lot of
creators seem to be raving about it, but
I'm assuming that they're raving about
it because they're being paid to rave
about it. uh it doesn't really blow my
mind or anything. Anyway, if you tap
directly into the API, it has a feature
that allows you to extend a song the
same way that we did in the Sunno test.
And so technically how this works is
they scan the song that you uploaded and
then fine-tune the model to generate
more of what it had [Music]
[Music]
heard. And now let's feed it the song
that's been encoded with Harmony Cloak
and Poisonify. Ultimately we have this
[Music]
another popular model that has this
extend feature is Meta's music gen and I
had tried that out early on but I might
as well show you what that does as well.
Here is audio fine-tuned on the original unencoded
[Music]
song. And then just like Miniax, when we
upload the version that's been encoded
by Harmony Clo and Poisonify, it hangs
this pretty pretty good. So, as I'm
editing this video, YouTube just
announced that they will be launching a
new feature for YouTube creators where
you can generate your own AI music. So,
there's some very conflicting interests
on the same website. God damn. There are
some challenges, though. First of all,
instrument classification is not only
exclusively used by Generative AI, it's
also used by Spotify to sort out the
recommendation algorithm. Now, my most
abrasive music may be recommended to
people who exclusively listen to
barberhop quartets, which I personally
think is awesome, but I could see how
some musicians might not want that. The
other challenge is efficiency. Harmony
Cloak requires a bunch of high-end
specialized GPUs currently. However, the
team is working on a much more efficient
model that they are testing as I record
this. And then here locally when working
on Poisonify, I've been using two RTX
5080 video cards, which takes about 2
hours per 15,000 iterations on an
18-second file for the adversarial noise
to be unnoticeable to the human ear.
That means that just my upcoming album
is taking around 2 weeks of non-stop GPU
grinding, which is coming in at around
242 kW, which in my case can be mostly
offset by the hot Georgia sun. But it
would cost most people in the US between
$40 and $150 worth of electricity
depending on their location and if they
stagger training sessions into off- peak
hours. So, while this all works really
well, the goal is to make this much more
efficient and scale it to where it can
be offered as an API that's hosted and
processed from top set labs, which by
the way is what voice swap will be
called in the future due to the
diversification of our offerings. But
that way, a willing distributor can
offer the option to AI proof or
poisonify your music when you upload it
to streaming
services. I've made a few videos
covering this, but music distribution is
not exactly great these days. In early
2024, Touncor had falsely accused me of
fake streaming on Spotify, removing my
entire catalog from every store and
streaming service without even notifying
me. Fortunately, this resulted in a lot
of bad press that ended up being noticed
by their parent company. And after a lot
of dick swinging, they restored most of
it, but royally screwed up the metadata.
So, in the last year, while my monthly
listeners have continued to grow on
other platforms, Tounor's mistake
resulted in me losing close to 100,000
listeners per month from my primary
library. Since that happened, I've been
negotiating my catalog with a lot of
different companies, and I came really
close to just selling the entire thing,
but I'm glad that I didn't because I
managed to find a diamond in the rough
in one of the potential buyers. I
started meeting with and pitching my AI
music detection to Jorge Bria, the CEO
of Symphonic Distribution. He seemed
open-minded enough to hear my plan with
Poisonify and Harmony Cloak and has the
resources to potentially incorporate an
API as an optional service for other
musicians in the future should we be
able to combine them into an efficient
process. When you're uploading your
album and you're uploading your release
cover, then you tell us like how much AI
was used to create this album cover. Was
all of it done using AI? Was some of it
done using AI? Or was none of it using
AI? And the same thing at the track
level mostly for us to be able to just
have awareness of it. We're doing that
because we're trying to show responsible
uh thought processes around AI and a lot
of the DSPs haven't yet came out came
out with legitimate guidance on what
they will do in terms of this content.
So this is kind of our way of starting
to inventory and being able to just get
a sense of how much of this is actually
happening within the ecosystem. So now
I'm doing something that I thought that
I may never do again. Finishing and
mastering a new album. My entire
discoraphy will also be randomly encoded
with one of these poison pill methods or
a combination of them or a variation of
them. I've also encoded some of them
with inaudible random adversarial noise
that does absolutely nothing at all. And
the reason I'm not going to tell anybody
which tracks are encoded with what is to
opiscate how this works technically so
it can't be avoided down the line as AI
music companies train new models. We've
covered a whole lot here, but another
thing that I'll cover more extensively
in a future video as I test more devices
is broad protection from AI listening
devices through targeted pressure waves.
If you haven't noticed, everywhere we
go, there's a combination of both smart
and dumb microphone equipped devices
listening to us. Let's say that you came
to see me perform live at a small
intimate concert that I was performing
and I want to play a song for you that I
haven't really finished yet, but I just
kind of want to bounce it off the audience.
Actually, I don't want your Instagram
followers to see it. So, uh you'll just
I can also make it device specific by
playing very specific audio files from
my phone. Alexa, what is 2 + 2? 2 + 2 is four.
Alexa, Alexa,
Alexa,
Alexa, Alexa, Alexa,
Alexa, Alexa.
I'll definitely be exploring this stuff
a lot more in the future, but for now,
I'm just really happy that artists may
soon have a way to push back using
technology without having to depend on
copyright or IP laws, because those
things have utterly failed us in recent
years in regards to AI. And even
expanding on that, I'm glad to be
involved in technology that will someday
give you the option to physically
protect your music or even personal
conversations from being recorded or logged.
logged.
The entire generative AI industry has a
much larger existential problem than
artists or creators using the same class
of technology to defend themselves. They
have to worry about the paro principle,
otherwise known as the 8020 rule. It's
not meant to be precise. It's more of an
estimate. But for example, many people
spend 80% of their time only wearing 20%
of the clothing that they own. Many
businesses get about 80% of their sales
from only 20% of their products. A lot
of software developers notice that 80%
of their bugs are caused by 20% of the
code. In a lot of healthcare systems,
about 80% of resources are used by only
20% of patients. I could keep going with
this. Considering how ubiquitous and
omnipresent the paro principle is in
computer science and data science, think
about this. Have you noticed that AI
image generators got really good really
fast, but are still nowhere near
perfect? In the last 2 years, it would
be hard for the average person to point
out a difference at all in AI image
generators. Even with the insane amount
of hype and investment that's going into
it, they still can't seem to figure out
things like text and hands without using
special tricks or extensions. And music
isn't that much different. When you see
or audio releasing these new versions
with new features, most of the new
features are not in the generative
quality themselves, but in features like
inpainting. And when there is an
improvement in the sound quality, many
times the customer finds themsself
trading customization for quality. For
example, the music may sound more
realistic or clear, but now it's
ignoring most of the text in your
prompt. I suspect that the reason for
this is because these AI models quickly
improved 80% with only 20% of the time,
investment, and work with training them.
And now just getting a 5% improvement on
those models is an expensive,
complicated, and unprofitable grind.
That's why it's possible for a company
like Topset Labs or Voice Swap to be
successful without any sort of runway
investment. Concentrating on the input
data and working with vocalists and
cutting them in financially is way less
expensive and much easier than the trial
and error of retraining bass models over
and over again for voices. And having
artists involved creatively in that
process is also way more efficient than
fine-tuning a vocal model for another
100,000 iterations. The biggest downside
to this business model is that paying
people doesn't seem as sexy to
investors, which is something that
should be sat on and digested for a
while. But you have to cut them in on
the profit and pay them royalties
long-term for the stretch in order for
this to work. Major record labels
figured this out about a 100 years ago.
And we all know that there's no shortage
of greed in that industry. Perhaps a
solid way of thinking about any
generative AI industry in relation to
art or any other creative industry is
like a race car. You could design the
ultimate car and raise incredible
amounts of money to engineer that car
and then make it and it could be the
fastest and sexiest and most efficient
car around, but it will be a tremendous
failure and waste of money if you only
put a tiny little bit of fuel in it and
prevent it from getting to its
destination or finish line. I'm not an
AI hater, not by a long shot. You could
go back into the early days of this
channel or even earlier with my music
and see that I have been fascinated with
generative AI for a very long time
before it became integrated with neural
networks. But the business side of me
firmly believes that developing a useful
tool will pay out much higher than
developing an investment
scheme. This video is sponsored by some
of you, my viewers. In fact, a ton of
research in this video has been paid for
by my viewers through Patreon. So, thank
you for being part of that. And if you
want to join a large, healthy, inspiring
Discord community and have access to my
music and field recordings and audio
production assets, you can join for as
little as $1. Thanks for watching. Keep
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.