Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
George Hotz | Programming | coding in UOps on AMD MI350X tickets.comma-con.com/events/comma/1859964 | george hotz archive | YouTubeToText
YouTube Transcript: George Hotz | Programming | coding in UOps on AMD MI350X tickets.comma-con.com/events/comma/1859964
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
This content is a live chat log from a stream, primarily featuring discussions around GPU programming, AI development, and the Tinygrad framework, interspersed with general chat and viewer interactions.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
dexterr145: hey
swimming_dragon: oh shit
cacaosteve: hi
ravroid: Good evening sir
tboah: helloooooo
0xTomOstSec: long time no seen
userismale: Yoo
ouroboredos: No sub modep lox
kristofferhelle: welcome back
m3ntors: Hey geooo hello man
endritpb: hi
leikowo: ey it’s the tiny guy
goudacheeseburgers: o7
dahorak: Oh shit
cuddlingusrl: non arc raiders stream
dantedante33: ai hotz or real one?
vox_bony: MAGA baby GoatEmotey GoatEmotey GoatEmotey elbunk4BEAST2 elbunk4BEAST2
tboah: it’s been a minute
Maoriratto: wellcom back
lemmywemmy: sup og
gallium_ow: is mic a bit low for anyone else?
dexterr145: @gallium_ow yeah
einqs: Quiet
swimming_dragon: how is HK
evilfjalar: wazzza
sambusa_123: yeah, very quiet
zee133: thank you, I was able to retire because of my long amd position
jmickelly9: Smart move
ChandyNoEgo: happy halloween, george!!!!
cjcostanzo: not watching the world series?
0xTomOstSec: nice 2 see you Georgeyyy
ravroid: Sports ball?
voynich_1209: Hello, any reason why u moved to HK?
Tony_Roubino: Hong kongers were moving to london few years ago lmaooo
nuttyodin1: nuttyodin1 subscribed at Tier 1. They've subscribed for 31 months!
nuttyodin1: Nice to se you after a long time
NatelBeats: NatelBeats subscribed at Tier 1. They've subscribed for 65 months, currently on a 42 month streak! Internet really used to be so much better
sambusa_123: sambusa_123 subscribed at Tier 1.
sambusa_123: i now have a voice!
nuttyodin1: ExtropicAI says they are the ones to bankrupt Nvidia
leikowo: any async mma on these ?
leikowo: how much does the vendor blas get ?
leikowo: I meant something like tcgen05.mma
Fleetwood2048: f32?
Fleetwood2048: f32 out ah right
Fleetwood2048: everyone
sambusa_123: not me!
leikowo: 1 thread per float acc!
leikowo: lmao
leikowo: cast acc!
leikowo: is ref same type ?
leikowo: what's bufferize ?
Fleetwood2048: looking more like TVM every day SeemsGood
leikowo: what's .after() ?
Fleetwood2048: can't stump mr chen
johnathanschmidt: johnathanschmidt subscribed with Prime. They've subscribed for 8 months!
leikowo: yeah but that's because they got cute
leikowo: yes
leikowo: cute::print_latex
leikowo: cute algebra is pretty cool ngl
leikowo: from ISA would be easier to follow
leikowo: from the pdf you reference
nickhitman21: nickhitman21 subscribed at Tier 1.
nickhitman21: привет, как дела?
leikowo: I asked at the beginning about the ref dtype x)
nickhitman21: good day, sir. can you explain to me as for a 5 year old kid what is happening? NotLikeThis
nickhitman21: claude code vibecoding
leikowo: do the inputs partition and acc partition match the formula under 7.1.4.1 and 7.1.4.2 ?
leikowo: your C tensor is empty, wouldn't we get numbers even without running the custom kernel ?
leikowo: (we also got global variables named M,N,K)
leikowo: B is 1
leikowo: pain(a, b, c) is hilariously accurate
leikowo: I don't know how you got that right tbh
leikowo: there was two values something iirc
RickAndMoreTea: geez rick am i too late to halloween party here
RickAndMoreTea: lets goo
leikowo: insane that there are no mentions of async in CDNA4 ISA
leikowo: really makes you want to divide A layout by (16,32)
gclawes: Man I wish I knew how to program GPUs even a little bit, this looks cool
leikowo: yeah it's pretty sick
leikowo: that's the shortest kernel calling a TC by hand I've seen
stevethesniperr: i have no idea what the hell is happening here im just hoping to learn via osmosis
leikowo: @stevethesniperr You have two matrices A and B, you multiply them: this operation is called a GEMM. GEMMs can be efficiently computed on GPUs, nowadays GPUs include custom cores for accelerating GEMMs. mr hotz is writing a clean gpu function doing this
somewatson: somewatson subscribed at Tier 1.
somewatson: puddot1Hi puddot1Hi
somewatson: do you have physical access to the MI350X or is this a cloud-based instance?
leikowo: @somewatson pretty sure it's one of the servers they got from AMD
leikowo: (physically)
somewatson: oh i see, nice
somewatson: happy halloween btw
somewatson: by the way, I’m subbing because I really want better driver support for AMD GPUs, so hope this sub helps
leikowo: UOp.group victorizes stores ?
leikowo: It would take me a lot more lmao
leikowo: so now we do larger blocks ? like 2x2 TC blocks ?
somewatson: is there a huge difference between the mi300A and mi350x driver wise? I’m hoping any changes you make also help me out downstream since I’m running the mi300a
leikowo: is this calling the builtin ??
leikowo: B200 is a pain because it's all async
leikowo: yeye it's like an intrinsic
leikowo: B200 you have to do a persistent kernel, use tcgen05 (so new tmem which is annoying) and TMA for async loads & maybe async stores if needed too
fart_factory: OMG GEORGE IS BACK
leikowo: ah and I forgot thread block clusters and TMA multicast
somewatson: very nice, glad to hear about the contract. impressed that you were able to get them to focus on it
leikowo: fuck it, use MxK and NxK inputs
jmickelly9: What a trade you did back then with AMD stock. Also great work with the contract with AMD.
leikowo: 2200Mhz
leikowo: pretty cool that if you do the flops/clk on the different TC instrunctions the 32x32x16 is the same as 16x16x32
somewatson: would you say it worth trying to set up on my mi300’s and begin to write up docs or are things still in motion?
leikowo: maybe easier to hit flops with cause you just need to dispatch less TC instructions and can use the other threads for data movement ?
leikowo: K is 32
leikowo: banger reshape
leikowo: er warp
leikowo: per block sorry
leikowo: AddrSpace.LOCAL is smem ?
leikowo: can we do vector load from gem ?
leikowo: dtype.vector from gmem*
leikowo: can we slice ?
leikowo: A[gx, :, gy, :]
leikowo: x)
leikowo: CuTe DSL uses slicing for this
leikowo: did you ever look into CuTe ? like divides, local_tile, partitions
leikowo: I am sold
leikowo: watch the MMA_traits<MMA_OP> file, they describe EVERY TC layouts using cute layouts
leikowo: what's the register max per thread on amd ?
nuttyodin1: great day to hangout with you code
leikowo: all loops and K
leikowo: can you statically check that it's equivalent to matmul by unrolling all UOps and comparing with normal matmul ?
leikowo: this is a lot lot of registers but I'm not familiar with AMD. I know nvidia wouldn't let you
leikowo: registers spilling ?
leikowo: larger BLOCK_K maybe
leikowo: unroll arg for UOp.range ?
leikowo: no more instruction cache x)
leikowo: no vector stores ? :c
leikowo: rdna renderer ?
leikowo: (BLOCK_M//TC_M, BLOCK_N//TC_N, 4) ?
leikowo: 64,64,128
leikowo: simple loads not vectorized no ?
leikowo: reading through the rdna4 doc, apparently you can load blocks of memory from gmem to rmem
leikowo: also DS_READ_B128
dp_smh: dp_smh subscribed at Tier 1.
leikowo: ohhh tensor cores got dedicated registers on amd
ibarres: Kappa
leikowo: would be nice to copy from smem to them
leikowo: what's these 16bit global loads
leikowo: can we cast to half8 before loading ?
leikowo: lots of flops for a single warp
leikowo: this asm looks very good
leikowo: we could try multiple warps per blocks
leikowo: I think the smem -> rmem load layout doesn't matter much
leikowo: are we even getting the BW from global men ?
leikowo: mem*
Cos_ighn: howdy
leikowo: wait wat
leikowo: who's lidx0
leikowo: has to be first dim I think
Cos_ighn: Cos_ighn subscribed with Prime. They've subscribed for 5 months!
leikowo: are you not making 64 warps of 4 threads ??
leikowo: no there shouldn't be any acc x)
leikowo: bank conflicts aren't **this** bad I think
ibarres: thats water bro
Santiago_LHC: Very nice tea setup
leikowo: please show the generated code again
leikowo: we don't see the screen lmao
Santiago_LHC: your cam stayed too big
Santiago_LHC: power tea
leikowo: divide BLOCK_M//4//TC_M
leikowo: nah it was for the acc, you did it already now
leikowo: AMD employees seeing you get 16TF out of a 2PF card
leikowo: (I got 22TF on a B200 like two weeks into my internship)
leikowo: cause K is good
leikowo: doesn't grow accumulator
leikowo: wait As stores
leikowo: shouldn't they also be loaded using multiple warps ?!
leikowo: probably not the issue though :(
leikowo: BLOCKM//4*BLOCK_N*BLOCK_K
leikowo: there is no way this is faster
leikowo: It should spill
leikowo: BLOCK_K should be involved
leikowo: yes, I just think we are memory bound and waiting for memory copies instead of pipelining them
leikowo: gmem
leikowo: can we diff fake and non fake kernel ?
leikowo: sorry, warp group 1 and warp group 4 not fake
leikowo: nonono
leikowo: I meant wg1 and wg4
leikowo: it's 6am sorry!
leikowo: just generated code
leikowo: like some loops should go away
leikowo: probably because we just duplicated the loaded data from gmem
ronmasas: How about brute forcing the best block_* combo?
codingfisch: The lonely knight against entropy still at work 🙌 You are not alone!
leikowo: was very cool to interact, it's 6:13am and I'm struggling to keep my eyes opened. good night! (I'll still listen but likely fall asleep)
leikowo: for the gmem loads: for A split M by WG and threads access 8 successive halfs each for B split N by WG and same 8 halfs for each thread
jmickelly9: What’s the bandwidth speed on the USB4 - Nvidia GPU
dp_smh: NVDA reaching out before Apple responds 😒
leikowo: you want a B200 server, a single B200 is a paperweight xd
somewatson: whatttt why no meetings?
somewatson: LUL
somewatson: are you passing out Halloween candy today? it’s super dark out here so I’m guessing you’re not in California right now
codingfisch: They will. You are so cheap!
sunnergunner: nope :)
pupscub: yay I am free
leikowo: GB200 superchip would be nice, there is no software for it (the C2C)
somewatson: HK is fun. I highly recommend Saizeriya. it’s a Japanese chain restaurant that does Italian food
somewatson: ngl but the best Italian food in the world is in Japan
somewatson: oh the spark sounds pretty fun to play with
iashyam: Good to catch you live here george. I watch your stream during coding at job. Thank you so much for inspiring me.
somewatson: yes I always go every time I visit
codingfisch: Can you reiterate? I just joined but used quite a lot PTX in the last weeks
somewatson: LUL it’s better than sbarro for sure
leikowo: fix: get an nvidia gpu
somewatson: although we all know the best NYC pizza is sbarros
leikowo: did you see how the gb200 works ?
leikowo: gpu memory is another numa node and can be used by system
leikowo: (and gpu can access system memory)
codingfisch: How much more value does your programming provide compared to an average tinycorp employee?
leikowo: ds read ?
dp_smh: why have the load_loop blocks below the compute?
lostendasauce: use ai
tbullshit: when you say we're going to assembly do you mean tinygrad generating the assembly
ardorendk: George, how are you handling quantization and scaling into MXFP6 from FP16? per-tensor, per-channel, or dynamic?
Bfgrocks: is that party ticket on the wall xD
480i: @somewatson i learned that from the office
noseynice: What is the best dish you got in hong kong so far?
GIGA10101: anyone know the expected timeframe for the 1.0 release?
somewatson: or how about using duckduckgo's AI chat?
somewatson: @480i LUL
datdoc: ill ssh to your server and finish the code
therollerblader: what is bro coding
tbullshit: ai girlfriends, just really early stage
therollerblader: wow
therollerblader: not even using ai coding for ai project wow
dp_smh: it seemed the same to me
LilTankster33: I’ve been studying your Tinygrad commits. I’m new to kernel level ML work, but I’m trying to understand how memory tiling affects perf. Respect for showing the process raw like this.
standardpointer: o/
standardpointer: woohoo, 235
ardorendk: nice
maland3r: !f
utkarshgill: what is coalescing
ardorendk: faster by +0
lostendasauce: id try grok too
noseynice: Anything outside of what the median person talks about is very scarce in any LLM datset so less probable results?
GENUlNlTY: LuvPeekR
dawkinsisdoper: how are you doing all this with out a cursor chat? using actual brain?
Parasocializing: AI brainrot
tbullshit: twitter pilled
Aesthetic_Champ: screen frozen?
Parasocializing: @Aesthetic_Champ yeah
tbullshit: mr streamer your screen is fucked
whiterunknight: good morning
quagmire_8: is rangeify in complete working mode yet
UltraScientism: @noseynice What do you mean? They definitely ingest lots of academic sources, including obscure ones.
dawkinsisdoper: @tbullshit just transparent font
pika2u: yo <3
tbullshit: mr streamer it's frozen
UltraScientism: yes, it's frozen
utkarshgill: screen stuck
azertysdfghjk: @georgehotz we can't see what you type, check your obs settings please
tbullshit: im about to post a ascii of the austrian painter
tbullshit: nvm crisis averted
uranus4head: do u think terry davis was good programmer
want7000: want7000 subscribed with Prime.
want7000: Thoughts on humanity's progress with large language models writing software?
lostendasauce: computers have no insight
UltraScientism: @lostendasauce for a year or two
TigreDozer: AI is more coherent in the answers than 50% of human population...
jmickelly9: There was a twitter space where the sort of software that LLMs clearly struggle with understanding are browsers, compilers and device drivers.
lostendasauce: unless there is a breakthrough in neroscience where thoughts can be translated to electrical signals. still would have to figure out where the eureka moment comes from
Daniel_Ehrhardt: will there be a comma 4 soon?
Daniel_Ehrhardt: 🙈👌
utkarshgill: how to understand rangeify ? I was getting used to older abstractions and shapetracker, now VIZ=1 sometimes feels harder to grasp
UltraScientism: How fast are you currently on this chip compared to pytorch + amd's standard driver stack?
AK_ttv_: Sup
ardorendk: reducing complexity usually improves speed as well
rvx_4: hmm
Hoosier_Cruiser: what up G
dp_smh: a wise man told me this week if you want speed you need a profiler
AK_ttv_: Sorry if this is off topic: I want to create an AI thats connected to the entire Android system, the AI should be intelligent to basically do everything you need and make personalized user predictions by usage...
AK_ttv_: Just an idea.
ardorendk: george will enable sub only mode ^
AK_ttv_: Ok.
hlidskjalf77708: FeelsStrongMan
verxile: looking forward to comma con :D
AK_ttv_: My other idea is to basically make a robot that can replace or be like a doordash... lolol
AK_ttv_: Anyways geohot
jetSETcodecat: Looking forward to Comma Con as well and also looking forward to you developing those Unitree or Boston Dynamics robots. So I can fight it, of course :)
want7000: Will read that now, but as I thought; it creates awful non-functioning code at random and all encounters with ai are unwanted and waste time. have yet to find a counter-example.
titondev: Hi SUBtember
480i: i feel like google these days is rough
whiterunknight: did you finish reading gravitys rainbow
480i: @whiterunknight pynchon!
Bfgrocks: Have you seen Grokipedia
AK_ttv_: lolol
espacevecto: chat, is he using a mac or linux with apple theme ?
UltraScientism: well there are non-llm AIs that for example solve (mechanically prove ) new math Olympiad problems at a good level. That's more impressive to me than the improvements in search that llms brought.
lotr_may: No subchat? PogChamp Hiii
AK_ttv_: So geohot
AK_ttv_: Wsg
7vevex: @UltraScientism Steven Wolfram was talking about that on stream yesterday
jetSETcodecat: jetSETcodecat subscribed at Tier 1.
jetSETcodecat: Yay <3
whiterunknight: goodbye plebs
jetSETcodecat: Now I have free room to talk in depth about how I'm gonna curb stomp a clanker , that is, after I make it fold my clothes and be generally less-than-human enough to do my tasks rote and routine tasks - Slower !!!
jetSETcodecat: I'm sorry, I'll take it back. After 10 years time and I see one capable enough to FIGHT me
jetSETcodecat: Apologies if I'm adamant about fighting robots - It's what gets me up in the morning to be frank. Maybe I'll learn tinygrad & manage to team up with Unitree to UFC fighting robots
jetSETcodecat: Nah you're good, on a personal level
jetSETcodecat: Haha, sorry, I'll learn tinygrad then
whiterunknight: worlds first ai twitch viewer
whiterunknight: bro used an em dash
whiterunknight: will it even run on a fold 7
want7000: they don't care about normal people, enthusiastically marketing bullshit to other rich people makes stock go up, i think
jmickelly9: AWS and Anthropic had an outage recently and just saw many ‘engineers’ could not do any work for the whole day without using Claude. That is beyond horrifying to see that ‘engineers’ lost the ability to code without AI these days.
whiterunknight: God damm ssh
jetSETcodecat: Huzzah
jetSETcodecat: Oh babyt
whiterunknight: hell yeah
jetSETcodecat: My lord "It just works" wasn't an exaggeration or even false brother spittin' facts LETS GOOO
dp_smh: yeah own that poser
whiterunknight: thank you for the demo
MinistryOfWarHacker: @georgehotz What should I eat today for breakfast?
BRODZELi: thank you for sharing
warmrow: how much do u use ai code editors daily
AugusDogus: when do you expect to see more tinygrad adoption?
GrandeurMusic: Great Work!
FatherOfFamilyValues: @georgehotz Will you grace us again with more creative writing in your blog?
espacevecto: Congrats
born2zen: hhahaha
Tsoding: I saw that guy. He's the pathetic loser :)
pika2u: Ty George for all you do big heart big heart <3
AugusDogus: what matters?
born2zen: <3
AugusDogus: word
whiterunknight: <3
jetSETcodecat: <3
GrandeurMusic: Write in Assembly, and it becomes beautiful!
jmickelly9: The USB4 to NVIDIA GPU on a Macbook with tinygrad driving it is amazing. Can’t wait for it.
utkarshgill: I implemented sub 500 line beautiful_lander in torch. gets a 200+ score in lunar lander < 250 episodes, reliably. when I directly port it to tinygrad, it gets stuck at 10-12% and its super slow. if I work on it and submit a PR to tinygrad/examples/beautiful_lander.py, would it be useful?
MIGGOXYZ: I aspire to be as cracked as you are one day
mokert2: Hi George, I have a conjecture that has been bouncing in my head for a while, do you think fundamentally model size is just bits? Like if you made the best possible 8GB param model in FP8 and say, an FP64 model with the same memory footprint, would they be equivalent? You may need a different net arch, activation etc, but are bits ultimately just bits regardless?
GrandeurMusic: Does it work for linux the same way (tinygrad) to NVDIA GPU
AugusDogus: What's the timeline look like for being faster than PyTorch?
GrandeurMusic: Great Stream! Thanks for the stream!
deepvision: I hate clean and fast code I get payed by the hour Kappa
parapuschel: hi geo hi chat
utkarshgill: I don't have a 4090
pika2u: I liked your recent blog posts!
Mach_34: Tea or coffee>
dp_smh: what did you think of WNY in your short time at RIT
Mach_34: Love chinese tea culture. I've been buying lots of puer
inbredcuckfromusa: is it this vod available on yt after yyour stream?
tolpa_zevak: Did George say anything about Durov’s and the Lieberman brothers’ projects on decentralized model training?
tohrxyz: tohrxyz subscribed at Tier 1. They've subscribed for 6 months, currently on a 6 month streak! Hey and good day!
jetSETcodecat: @inbredcuckfromusa geohotarchive on YT or type george hotz archive anywhere
jetSETcodecat: Thanks for the stream. See you next stream <3
MinistryOfWarHacker: from MONAD
ibarres: when simple in order VLIW isa gen
JVPost: are you saying, the truth will set you free?
kinglouisxvii: Based
tohrxyz: write that into them books
pika2u: Have you ever read Silver Surfer: Requiem? I have recently and really recommend it
MinistryOfWarHacker: truth is love
ivyharvey: Octopus card is the truth
ardorendk: did u just said slavery is good? Kappa
Y3absira: hello george
UltraScientism: what's so good about hong kong?
Y3absira: 2 2's to my word fam
hulkemist: thanks for stream
AugusDogus: cheers
tohrxyz: happy halloweeen guys
PFASpartout: Joy comes from making sense out of this mess
ardorendk: don't kill anyone
whiterunknight: :(
tohrxyz: dont be too scary on kids
GrandeurMusic: you are in Japan?
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.