YouTube Transcript:
The Art Of Poison-Pilling Music Files

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

If you're a musician, I have some great

news for you. I've been really busy

lately and a little bit

naughty. I've made a living as an

independent professional musician for

over 25 years now. And once tech

companies started raising millions of

venture capital dollars and scraping my

music without my consent, then

generating shittier music with it that

is inadvertently associated with my name

and then attempting to resell that in

the same economy in which I make money

from my music. I was just like, you know

what, this I still enjoy making music

all the time, but I have entirely

stopped releasing it. But what I'm

showing you today is a type of encoding

that not only makes a music file more or

less untrainable by generative AI

companies, but actually has the ability

to decrease the quality and efficiency

of their entire data set. I'm going to

be showing you a lot of demonstrations

of this technology in this video, but it

attacks AI in a whole lot of other ways,

and some of them are really scary. For

example, I could say something

completely normal here or even play you

the sound of an eagle. And that sound

would be telling Siri or your Echo or

Google Home device to unlock your doors.

The adjective vulnerable was usually

defined as capable of or susceptible to

being wounded or hurt. Or I could make

Spotify think that a bunch of sex noises

are an acoustic Christian folk song. And

we're going to be exploring all of these

things. Unethical generative AI

companies have made artists feel

incredibly powerless for quite some time

now. But all of that is about to change.

And I am extremely excited to be able to

finally tell you about it in this video.

Come on. [Music]

[Music]

The modern chapter of the AI music story

begins in 2015 with the publication and

proposal of UNET, a frankly ingenious

method of using a convolutional neural

network for advanced pattern recognition

in biomedical imagery. In simpler terms,

UNET invented a way to recognize things

in a much much more efficient way that

didn't even require more than a few

images of training data to work. I'd go

more into the detail of how it works,

but some of my viewers may be driving,

and I don't want to cause any accidents

by making them drowsy. All we need to

know for now is that UNET's architecture

inspired the technology that is behind

virtually all generative AI models

today. Then in 2016, Google introduced

Magenta, which is a research project

that uses machine learning to scan

insane amounts of music to learn from it

in order to create new tools to create

music. There are a lot of neat projects

with Magenta that you can play with for

free today. You can have a piano play a

duet with you, or you can turn your

voice into a saxophone sound, or play a

harp that morphs between learned sounds,

or make a bunch of circles to make a

basic melody. It's fun stuff to play

with, but a lot of it is also just a

really expensive and inefficient way to

do what modular synths like that have

been doing for decades. There is one

thing that you should consider before we

move on. Since this era, generative AI

music has always been primarily, with

very few exceptions, a lot of

showmanship for investors and more or

less a solution without any sort of

problem. For example, another huge AI

music technology landmark in 2016 was

Sony unveiling a song that was made with

artificial intelligence called Daddy's

Car. Now, the song sounds pretty

mid, so imagine my complete and utter

surprise to find out that it was

performed, recorded, and mastered by

humans. And the lyrics on the other hand

sound like some stereotypically bad

early AI generated But it turns

out that that was written by a human

too. Now you may be asking yourself how

exactly this qualifies as artificial

intelligence. But as the lyrics of the

song famously say, from tax man until

tomorrow, never know. It wouldn't be

long before the goal of these tools was

to actually generate AI music. With Open

AI releasing Museet and Bite Dance

acquiring Juke Deck in 2019, then Meta

and Stability followed suit. And it

wasn't long before we had a bunch of

voice cloning services and consumer

targeted subscription services that

offered AI generated music like SNO and

UO. Here's a fun social experiment I

would like you to participate in. The

next time a company announces a

generative AI feature or product or

service that is for sale or for

subscription, ask them the magic 10

words. What data did you use to train

your base model? More often than not,

those 10 words magically work as a mute

button for tech because they

either don't know the answer to the

question or answering the question will

make them liable for literally billions

of dollars in IP infringement damages.

That's because they just recklessly

scraped Spotify and YouTube and Audible

and virtually anywhere that they could

find data, whether it was copyrighted or

not. And remember how the recording

industry of America would sue soccer

moms because their kids downloaded a few

albums? It's just like that times a few

hundred million and also while raising a

few hundred million in venture capital.

And then naturally like a swarm of

locusts, tons of opportunists started

using music generating services to

generate millions of songs to put on

streaming services like Spotify, which

then siphoned royalties away from actual

musicians. And as if all of this wasn't

enough of a slap in the face to

musicians, services like Sunno literally

set up funds to pay their top AI music

creators without ever even considering

paying the actual musicians that made

the music their entire service trained

on. Starting in 2023, I had met with

various US senators and their staffers

about changing legislation to require

generative AI companies above a certain

size to start keeping a record of their

training data and requiring consent from

intellectual property holders before

using it for their products and

services. A lot of artists advocacy

groups and unions are also doing this

and quite a few early bills were penned.

But when the presidential inauguration

was crammed full of tech titans who are

spending billions on generative AI, I

kind of realized that we're going to

have to take this into our own

hands. This is a little bit of personal

history, but it's very relevant to this

video, so stick with me. In early 2023,

I joined DJ Fresh and Nico Polarin in

co-founding Voice Swap AI, and we

expanded the team by proposing a royalty

mechanism for vocalists participating in

Generative AI that not only paid them

ongoing licensing fees, but made them

part of an equity pool in the company.

What we found is that by training an

entirely new vocal base model on

consensual data and fine-tuning it in

collaboration with the artists, the

resulting voice model sounded superior

to our competitors. As a result of this,

we quickly grew to 150,000 users without

so much as a penny in thirdparty or

venture capital investment. We became

busy with large business clients and

have remained the only AI company that

qualifies for BMAT certification and can

pay royalties on this new type of

intellectual property. And much more

importantly, the majority of our

vocalists were earning more annually

through their voice model than the

entirety of what they were earning on

Spotify and other streaming services.

Come on, not bad, right? Last year, one

of the projects that I put a lot of time

and research into was finding a

foolproof way to take an original music

master file and detect if the music in

that file was generated with AI or not.

If you come up with a solid idea or

maybe a new or original process to

accomplish a task and you release it to

the public for free, it's only a matter

of time before big companies will say,

"I made this." And then gatekeep behind

a payw wall. Fortunately, somebody

tipped me off that this was happening

and I was able to apply for a patent

myself to prevent it from happening in

the future. But I learned this expensive

lesson a few months ago. The problem

with my little AI music detector is not

in its functionality, but in the lack of

incentive to use it. The people who are

generating the AI music to put it on the

streaming services and make money have

to pay middlemen distributors to publish

it and manage their royalties. And while

for some reason I thought that these

services would jump at the chance to

refuse money and not accept AI music,

they didn't seem interested in this

proposal. Surprising, I know. But

meanwhile, I had been researching a type

of technology that actually isn't all

that new. Adversarial noise. This term

first sprouted up a decade ago when

virtually every piece of technology

included a little AI assistant that you

could talk to. The infosc industry has

been aware of this for a while now, as

the information that a neural network

gathers from a sound is very different

than what a human brain gathers. This

means that just about anything that you

can accomplish via a voice command, like

ordering something on Amazon or opening

your garage door, can presumably be

triggered by a sound that human beings

cannot identify. And this is

accomplished by using adversarial noise.

Let me demonstrate. Here's an attack on

an Amazon Echo Show, which by the way is

like the worst tech device that I've

ever used in my life. Let's just play

some soft classical music in the

background. And whoopsie. Benley Jordan,

born October 28th, 1979, is an American

musician operating under many

pseudonyms. Here's my attack on the AI

model that's been used for speech

recognition by Meta, Facebook,

Instagram, Oculus, and then a whole lot

of others as well. If we run it

directly, we can see exactly what the AI

Let's move over to the white hat side of

this attack. Giant generative AI firms

have been scraping copyrighted

audiobooks to train their voice models,

but also learning from the content

within the audiobook itself. So, let's

encode that audio file to make the AI

hear nonsense. And since it's using a

self-supervised neural network like

Hubert, it'll reinforce a false

positive, meaning that the entire model

will be shittier after it's done

training on my adversarial noise here.

My alarm clock woke me that day as

always at

6:13. I went to the kitchen, made coffee

and toast. My alarm clock woke me that

day as always at

6:13. I went to the kitchen, made coffee

and toast. My friends at the University

of Tennessee Knoxville have made an art

out of these types of attacks. This one,

for example, utilizes a small physical

speaker to introduce an aa audible layer

to your own speech and commands in real

time by mimicking small environmental

sounds in realworld distortions. So even

if you have high security settings where

your AI assistant can only respond to

your voice for highsecurity commands

like disabling your alarm system, that

also can be manipulated. Speaking of the

University of Tennessee, I took a trip

up to steamy old Knoxville to visit

Sedon Ali Mirza and Gianlu to learn more

about Harmony Cloak, one of the research

team's more recent projects that encodes

a music file with adversarial noise that

utterly breaks AI's ability to find

melody or rhythm. Here's some of their

own demos with some pretty simple basic [Music]

[Music] [Applause]

[Applause]

music. And here is what the AI models

generate based on the unencoded music files.

Now, here's what they create from the

[Music]

Cloak. Anyone familiar with AI training

will recognize this, by the way. It's

tensorboard and it functions as a

convenient visual guide to know when

training is no longer improving your

model. That way you know when to stop

the process as to not waste time and

energy. The red line is from training on

normal audio files and the blue is files

encoded with harmony cloak. You'll

notice that almost immediately the model

stops being able to improve itself. Have

you tested the original versus perturbed

music again? Like have you tested it

with other students? Have people claimed

to be able to hear the difference? We

have a user study. Uh in the paper we

involved over 30 participants. I think

most of them are music lovers. We asked

them to give a rating. I think the

rating we got from the for the unlearn

book examples are pretty similar to the

clean music. The model has been

rigorously tested and GN sad and their

team have been actively working to make

it more efficient. It was then and there

in Knoxville when I suddenly had the

urge to release music again as a test

subject for widespread development. But

I had a lot more work to do and a whole

lot of training and

testing. Okay, so remember UNET from

earlier in this video. UNET pioneered

something called diffusion. Instead of

trying to generate an image or sound by

drawing it, they start with noise and

then shape it based on what it learned

from training. Now, music generating

algorithms seem like a pretty big

technological leap from image generating

algorithms. But when you introduce a

music file to a modern neural network to

learn from and train on, it's merely

looking for the spectrogram of the

image, which can then be interpreted as

audio. If you go back some years on this

channel, you could see some fun videos

of me playing creatively with software

that does this. So in that spectral

image, the AI looks for two classes of

characteristics. The first is the tones,

the rhythm, the melody as seen here, and

more importantly, why this chord often

seems to come after this chord in a

particular style of music or how or why

a swing beat works in a particular

situation. The second is the sounds

themselves, something that Google's

Magenta had pioneered very early on.

This is why AI stem separation became

one of the first features sites and

software offered. Using these types of

algorithms, the AI can detect what part

of the spectral image has drum sounds in

it or bass or guitar. Anyway, that

second class of functionality is what I

spent an absurd amount of time trying to

break. Now, remember what I showed you

with adversarial noise, inserting

inaudible commands to say something

different to a speech recognition

engine. Well, turns out that I could do

these targeted attacks on instruments,

too. One of the most advanced and

accurate instrument classifiers is an

API offered by CAM. So, what does this

sound like to you? If you guessed

symbol, good job. However, after

encoding it with adversarial noise,

instrument classifiers think it's a

harmonica or that this song has string

instruments in it. This also could

potentially have a snowball effect by

making a generative AI model be

continuously fooled by false positives.

So, it'll be more likely to think that

every similar synthesizer sound it

encounters is also a string quartet. And

for this reason, I've been calling my

attack Poisonify. Harmony Cloak combined

with Poisonify makes music not only

untrainable, but it threatens to degrade

the quality of the entire

model. Sunno is a generative AI company

that started out in late 2023 with text

to speech and generative music models

that have now raised hundreds of

millions of venture capital dollars.

They are now in a legal battle with the

recording industry of America for

damages amounting to something like $1.5

trillion for training their models on

copyrighted data without asking anyone

for permission. Now, Michael Schulman,

Suno CEO, has some really enlightening

takes. It's not really enjoyable to make

music. Now, I think the majority of

people don't enjoy the majority of the

time they spend making music. And by

enlightening, I meant that there simply

has to be something enlightening about

being so astray from reality that you

would think that musicians don't enjoy

making music. And within the delusional

reality that Michael lives in, this is

the problem that he's providing a

solution to. And to be fair, this

mindset opens up a lot of really

lucrative and creative business

opportunities. Maybe he could expand

Sunno to make autonomous machines that

will roll bowling balls down a lane, so

you could just pay him a subscription

fee instead of bowling with your

friends. Aside from the Mental

Gymnastics Olympics that the company was

founded on, Sunno has a really useful

feature where you can upload a song and

then the service will automatically

extend it. It doesn't seem to listen to

prompts very well and it doesn't sound

very good, but it does provide a great

test bed for my little project here. So,

here we go. We can upload my original song

song [Music]

[Music]

here. And now here is Sunno's AI

extension of that song. Fragments of

light lost in data

sweep shadow memory in the glowing

rain. Okay, now let's upload my

poisonify encoded [Music]

track. And here is Sununo's AI generated extension.

extension. [Music]

[Music]

I would describe this as music from an

airport spa that somebody downloaded off

of Napster in 1999. There's an even more

recent and competitive Generative AI

music outfit from China that I only know

about initially just because they blew

me up so much in my email trying to get

me to promote them on this channel. Ask

and you shall receive, I guess. It's

called Miniax Audio and a lot of

creators seem to be raving about it, but

I'm assuming that they're raving about

it because they're being paid to rave

about it. uh it doesn't really blow my

mind or anything. Anyway, if you tap

directly into the API, it has a feature

that allows you to extend a song the

same way that we did in the Sunno test.

And so technically how this works is

they scan the song that you uploaded and

then fine-tune the model to generate

more of what it had [Music]

[Music]

heard. And now let's feed it the song

that's been encoded with Harmony Cloak

and Poisonify. Ultimately we have this

[Music]

another popular model that has this

extend feature is Meta's music gen and I

had tried that out early on but I might

as well show you what that does as well.

Here is audio fine-tuned on the original unencoded

[Music]

song. And then just like Miniax, when we

upload the version that's been encoded

by Harmony Clo and Poisonify, it hangs

this pretty pretty good. So, as I'm

editing this video, YouTube just

announced that they will be launching a

new feature for YouTube creators where

you can generate your own AI music. So,

there's some very conflicting interests

on the same website. God damn. There are

some challenges, though. First of all,

instrument classification is not only

exclusively used by Generative AI, it's

also used by Spotify to sort out the

recommendation algorithm. Now, my most

abrasive music may be recommended to

people who exclusively listen to

barberhop quartets, which I personally

think is awesome, but I could see how

some musicians might not want that. The

other challenge is efficiency. Harmony

Cloak requires a bunch of high-end

specialized GPUs currently. However, the

team is working on a much more efficient

model that they are testing as I record

this. And then here locally when working

on Poisonify, I've been using two RTX

5080 video cards, which takes about 2

hours per 15,000 iterations on an

18-second file for the adversarial noise

to be unnoticeable to the human ear.

That means that just my upcoming album

is taking around 2 weeks of non-stop GPU

grinding, which is coming in at around

242 kW, which in my case can be mostly

offset by the hot Georgia sun. But it

would cost most people in the US between

$40 and $150 worth of electricity

depending on their location and if they

stagger training sessions into off- peak

hours. So, while this all works really

well, the goal is to make this much more

efficient and scale it to where it can

be offered as an API that's hosted and

processed from top set labs, which by

the way is what voice swap will be

called in the future due to the

diversification of our offerings. But

that way, a willing distributor can

offer the option to AI proof or

poisonify your music when you upload it

to streaming

services. I've made a few videos

covering this, but music distribution is

not exactly great these days. In early

2024, Touncor had falsely accused me of

fake streaming on Spotify, removing my

entire catalog from every store and

streaming service without even notifying

me. Fortunately, this resulted in a lot

of bad press that ended up being noticed

by their parent company. And after a lot

of dick swinging, they restored most of

it, but royally screwed up the metadata.

So, in the last year, while my monthly

listeners have continued to grow on

other platforms, Tounor's mistake

resulted in me losing close to 100,000

listeners per month from my primary

library. Since that happened, I've been

negotiating my catalog with a lot of

different companies, and I came really

close to just selling the entire thing,

but I'm glad that I didn't because I

managed to find a diamond in the rough

in one of the potential buyers. I

started meeting with and pitching my AI

music detection to Jorge Bria, the CEO

of Symphonic Distribution. He seemed

open-minded enough to hear my plan with

Poisonify and Harmony Cloak and has the

resources to potentially incorporate an

API as an optional service for other

musicians in the future should we be

able to combine them into an efficient

process. When you're uploading your

album and you're uploading your release

cover, then you tell us like how much AI

was used to create this album cover. Was

all of it done using AI? Was some of it

done using AI? Or was none of it using

AI? And the same thing at the track

level mostly for us to be able to just

have awareness of it. We're doing that

because we're trying to show responsible

uh thought processes around AI and a lot

of the DSPs haven't yet came out came

out with legitimate guidance on what

they will do in terms of this content.

So this is kind of our way of starting

to inventory and being able to just get

a sense of how much of this is actually

happening within the ecosystem. So now

I'm doing something that I thought that

I may never do again. Finishing and

mastering a new album. My entire

discoraphy will also be randomly encoded

with one of these poison pill methods or

a combination of them or a variation of

them. I've also encoded some of them

with inaudible random adversarial noise

that does absolutely nothing at all. And

the reason I'm not going to tell anybody

which tracks are encoded with what is to

opiscate how this works technically so

it can't be avoided down the line as AI

music companies train new models. We've

covered a whole lot here, but another

thing that I'll cover more extensively

in a future video as I test more devices

is broad protection from AI listening

devices through targeted pressure waves.

If you haven't noticed, everywhere we

go, there's a combination of both smart

and dumb microphone equipped devices

listening to us. Let's say that you came

to see me perform live at a small

intimate concert that I was performing

and I want to play a song for you that I

haven't really finished yet, but I just

kind of want to bounce it off the audience.

Actually, I don't want your Instagram

followers to see it. So, uh you'll just

I can also make it device specific by

playing very specific audio files from

my phone. Alexa, what is 2 + 2? 2 + 2 is four.

Alexa, Alexa,

Alexa,

Alexa, Alexa, Alexa,

Alexa, Alexa.

I'll definitely be exploring this stuff

a lot more in the future, but for now,

I'm just really happy that artists may

soon have a way to push back using

technology without having to depend on

things have utterly failed us in recent

years in regards to AI. And even

expanding on that, I'm glad to be

involved in technology that will someday

give you the option to physically

protect your music or even personal

conversations from being recorded or logged.

logged.

The entire generative AI industry has a

much larger existential problem than

artists or creators using the same class

of technology to defend themselves. They

have to worry about the paro principle,

otherwise known as the 8020 rule. It's

not meant to be precise. It's more of an

estimate. But for example, many people

spend 80% of their time only wearing 20%

of the clothing that they own. Many

businesses get about 80% of their sales

from only 20% of their products. A lot

of software developers notice that 80%

of their bugs are caused by 20% of the

code. In a lot of healthcare systems,

about 80% of resources are used by only

20% of patients. I could keep going with

this. Considering how ubiquitous and

omnipresent the paro principle is in

computer science and data science, think

about this. Have you noticed that AI

image generators got really good really

fast, but are still nowhere near

perfect? In the last 2 years, it would

be hard for the average person to point

out a difference at all in AI image

generators. Even with the insane amount

of hype and investment that's going into

it, they still can't seem to figure out

things like text and hands without using

special tricks or extensions. And music

isn't that much different. When you see

or audio releasing these new versions

with new features, most of the new

features are not in the generative

quality themselves, but in features like

inpainting. And when there is an

improvement in the sound quality, many

times the customer finds themsself

trading customization for quality. For

example, the music may sound more

realistic or clear, but now it's

ignoring most of the text in your

prompt. I suspect that the reason for

this is because these AI models quickly

improved 80% with only 20% of the time,

investment, and work with training them.

And now just getting a 5% improvement on

those models is an expensive,

complicated, and unprofitable grind.

That's why it's possible for a company

like Topset Labs or Voice Swap to be

successful without any sort of runway

investment. Concentrating on the input

data and working with vocalists and

cutting them in financially is way less

expensive and much easier than the trial

and error of retraining bass models over

and over again for voices. And having

artists involved creatively in that

process is also way more efficient than

fine-tuning a vocal model for another

100,000 iterations. The biggest downside

to this business model is that paying

people doesn't seem as sexy to

investors, which is something that

should be sat on and digested for a

while. But you have to cut them in on

the profit and pay them royalties

long-term for the stretch in order for

this to work. Major record labels

figured this out about a 100 years ago.

And we all know that there's no shortage

of greed in that industry. Perhaps a

solid way of thinking about any

generative AI industry in relation to

art or any other creative industry is

like a race car. You could design the

ultimate car and raise incredible

amounts of money to engineer that car

and then make it and it could be the

fastest and sexiest and most efficient

car around, but it will be a tremendous

failure and waste of money if you only

put a tiny little bit of fuel in it and

prevent it from getting to its

destination or finish line. I'm not an

AI hater, not by a long shot. You could

go back into the early days of this

channel or even earlier with my music

and see that I have been fascinated with

generative AI for a very long time

before it became integrated with neural

networks. But the business side of me

firmly believes that developing a useful

tool will pay out much higher than

developing an investment

scheme. This video is sponsored by some

of you, my viewers. In fact, a ton of

research in this video has been paid for

by my viewers through Patreon. So, thank

you for being part of that. And if you

want to join a large, healthy, inspiring

Discord community and have access to my

music and field recordings and audio

production assets, you can join for as

little as $1. Thanks for watching. Keep

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:The Art Of Poison-Pilling Music Files

AutoDub

Video Transcript

Paste YouTube URL

Transcript Extraction Form

Get Our Chrome Extension

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube Transcript:
The Art Of Poison-Pilling Music Files