What I find most interesting with this is that it shows they believe there is nothing unique at Meta related to AI. There is no resource, people and computing power, that they can't get elsewhere for whatever they believe would be more interesting for them.
I mention this because it feels analogous to military research, where people "dream" of how advanced the military is, how forward they are compared to public research... and yet, it seems to be a recurring myth they love to sustain.
So the signal I get here is AI "labs" in BigTech have nothing worth waiting for around the corner, it's just more of the same and boring for people who stick there.
Negative, what you shave taken away is it’s the people. He mentions standing up clusters. Small shops can’t afford clusters. Ignore the technical aspect of this article and read it for what it’s for. A thank you note to the people he has worked with on amazing projects. Research in a bubble of 1 isn’t very useful. Research in a small team with Meta Budget is extremely useful. With the right people.
I don't think that's the read? Guy says he wants to work on something small. If you want to work on something big you probably want to be in a big corp to have the resources to do the big thing.
Also absolutely unknown if the "new thing" is AI-related at all!
Well he left so whatever is coming next, AI related or not, "small" or not (small for them might be reaching just a million people, he wrote that he "lead the software layer that powers the entire AI industry." so his notion of scale is probably unlike mine, maybe yours too) is more exciting to him that whatever he could do next with all of Meta's resources.
Edit: to be clear, I didn't mean to imply their next thing is AI related, solely that they obviously know more about AI at Meta than e.g. XR at Meta, just because that's their expertise.
> I mention this because it feels analogous to military research, where people "dream" of how advanced the military is, how forward they are compared to public research... and yet, it seems to be a recurring myth they love to sustain.
I don't think that you can read this from the blog post at all, but it gives me a chuckle to think how the quest for AGI at Meta may be "The Men Who Stare at Goats" all over again.
I'm totally speculating. I have no extra information there.
It just makes me think of all the staff, technical staff, that left OpenAI recently. Altman was making grand claims about what was coming next.
Well, we know what followed, namely I don't think any researcher who left knowing what was in the pipeline feel like they missed much in terms of access.
The non-fiction book behind it is probably better comparison than the film adaptation, if you think Meta are doing goat-staring (I don't think they're especially bad on this issue compared to their rivals).
That man has an infective enthusiasm. I remember the DCGAN paper inspired me to try getting the (Lua) Torch code to work, and I tried it on the Oxford flowers dataset early on. It worked surprisingly well, and Soumith Chintala even shared it around in social media, surprised at how well it worked on such a small dataset. Of course back then we didn't really appreciate the problem of mode collapse.
Pytorch and old Lua Torch were a pleasure to work with compared to the contemporary Tensorflow. Lots of S.C's code was copied around liberally, it had its quirks (I remember the DCGAN code had a pretty odd way of doing parameter passing) but it was also really easy to understand and made random people like me feel like we had suddenly stumbled onto something crazy powerful (which we had!). It was wonderfully hackable.
For anyone that’s curious, the underlying Torch library is also a joy to work with, as are the many other torch bindings. For example, Rust has tch and Burn which both work with libtorch.
PyTorch of course has the benefit of being dynamically debuggable. Can’t forget the first time I break pointed my pytorch model and wrote pytorch calls inside the terminal to inspect the behavior. That’s still something I miss a lot now that I’m working with only “fast” compiled code.
As a loyal JAX user, I hope they can play catchup. PyTorch has dominated the AI scene since TF1 fumbled the ball at 10th yard line. What Matt Johnson has done turning Autograd into JAX is hopefully going to be worthy of as much praise as what Soumith has received.
For me it was about 8 years ago. Back then TF was already bloated but had two weaknesses. Their bet on static compute graphs made writing code verbose and debugging difficult.
The few people I know back then used keras instead. I switched to PyTorch for my next project which was more "batteries included".
I personally believe TF1 was serving the need of its core users. It provided a compileable compute graph with autodiff, and you got very efficient training and inference from it. There was a steep learning curve, but if you got past it, things worked very very well. The distributed TF never really took off—it was buggy, and I think they made some wrong earlier bets in the design for performance reasons that they should have been sacrificed in favor of simplicity.
I believe some years after the TF1 release, they realized the learning curve was too steep, they were losing users to PyTorch. I think also the Cloud team was attempting to sell customers on their amazing DL tech, which was falling flat. So they tried to keep the TF brand while totally changing the product under the hood by introducing imperative programming and gradient tapes. They killed TF1, upsetting those users, while not having a fully functioning TF2, all the while having plenty of documentation pointing to TF1 references that didn’t work. Any new grad student made the simple choice of using a tool that was user-friendly and worked, which was PyTorch. And most old TF1 users hopped on the band wagon.
Imagine a total newbie trying to fine-tune an image classifier, reusing some open source example code, about a decade ago.
If their folder of 10,000 labelled images contains one image that's a different size to the others, the training job will fail with an error about unexpected dimensions while concatenating.
But it won't be able to say the file's name, or that the problem is an input image of the wrong size. It'll just say it can't concatenate tensors of different sizes.
An experienced user will recognise the error immediately, and will have run a data cleansing script beforehand anyway. But it's not experienced users who bounce from frameworks, it's newbies.
> An experienced user will recognise the error immediately, and will have run a data cleansing script beforehand anyway. But it's not experienced users who bounce from frameworks, it's newbies.
Even seasoned developers will bounce away from frameworks or libraries - no matter if old dogs or the next hot thing - if the documentation isn't up to speed or simple, common tasks require wading through dozens of pages of documentation.
Writing good documentation is hard enough, writing relevant "common usage examples" is even harder... but keeping them up to date and working is a rarely seen art.
And the greatest art of all of it is logging. Soooo many libraries refuse to implement detailed structured logging in internal classes (despite particularly Java and PHP offering very powerful mechanisms), making it much more difficult to troubleshoot problems in the field.
>>Every major AI company and hardware vendor are on a speed dial. This kind of power is really hard to give up. But curiosity ultimately won out in my head.
A simple feeling has such a power.
May he gets an opportunity to create one more powerful tool before retiring.
I read one post on his blog and found that Adam Paszke reached out to the author and got an internship. I wonder if it was that easy to get an internship at FAIR. I thought that they hire only PhDs.
PyTorch is one of those tools that’s so simple and easy to take apart that you feel like you might’ve been able to make it yourself. I can’t imagine how much engineering effort was behind all those moments where I thought to myself, “of course it should work like that, how can it be any other way?”
The choice of the dynamic computation graph [1] of PyTorch made it easier to debug and implement, leading to higher adoption, even though running speed was initially slower (and therefore training cost higher).
Other decisions follow from this one.
Tensorflow started with static and had to move to dynamic at version 2.0, which broke everything. Fragmentation between tensorflow 1, tensorflow 2, keras, jax.
Pytorch's compilation of this computation graph erased the remaining edge of Tensorflow.
Is the battle over ? From a purely computational point, Pytorch solution is very far from optimal and billions of dollars of electricity and GPUs are burned every year, but major players are happy with circular deals to entrench their positions. So at the pace of current AI code development, probably one or two years before Pytorch is old history.
> at the pace of current AI code development, probably one or two years before Pytorch is old history.
Ehhh, I don’t know about that.
Sure, new AI techniques and new models are coming out pretty fast, but when I go to work with a new AI project, they’re often using a version of PyTorch or CUDA from when the project began a year or two ago. It’s been super annoying having to update projects to PyTorch 2.7.0 and CUDA 12.8 so I can run them on RTX 5000 series GPUs.
All this to say: If PyTorch was going to be replaced in a year or two, we’d know the name of its killer by now, and they’d be the talk of HN. Not to mention that at this point all of the PhDs flooding into AI startups wrote their grad work in PyTorch, it has a lot of network lock-in that an upstart would have to overcome by being way better at something PyTorch can never be good at. I don’t even know what that would be.
Bear in mind that it took a few years for Tensorflow to die out due to lock in, and we all knew about PyTorch that whole time.
I don't know the full list, but back when it came out, TF felt like a crude set of bindings to the underlying c++/CUDA workhorse. PyTorch felt, in contrast, pythonic. It was much closer in feeling to numpy.
I think it was mostly the eager evaluation that made it possible to debug every step in the network forward/backward passes.
Tensorflow didn't have that at the time which made debugging practically impossible.
I’m not sure if such an overview exists, but when caffe2 was still a thing and JAX was a big contender dynamic vs static computational graphs seemed to be a major focus point for people ranking the frameworks.
If you take advice from reformed Internet trolls, consider turning off all your devices and trying to give yourself at least a week, but ideally a month offline staring at your new baby. You'll never get that time back and there's nothing your brain will appreciate more than loading up those memories as they grow.
What I find most interesting with this is that it shows they believe there is nothing unique at Meta related to AI. There is no resource, people and computing power, that they can't get elsewhere for whatever they believe would be more interesting for them.
I mention this because it feels analogous to military research, where people "dream" of how advanced the military is, how forward they are compared to public research... and yet, it seems to be a recurring myth they love to sustain.
So the signal I get here is AI "labs" in BigTech have nothing worth waiting for around the corner, it's just more of the same and boring for people who stick there.
Negative, what you shave taken away is it’s the people. He mentions standing up clusters. Small shops can’t afford clusters. Ignore the technical aspect of this article and read it for what it’s for. A thank you note to the people he has worked with on amazing projects. Research in a bubble of 1 isn’t very useful. Research in a small team with Meta Budget is extremely useful. With the right people.
I don't think that's the read? Guy says he wants to work on something small. If you want to work on something big you probably want to be in a big corp to have the resources to do the big thing.
Also absolutely unknown if the "new thing" is AI-related at all!
Well he left so whatever is coming next, AI related or not, "small" or not (small for them might be reaching just a million people, he wrote that he "lead the software layer that powers the entire AI industry." so his notion of scale is probably unlike mine, maybe yours too) is more exciting to him that whatever he could do next with all of Meta's resources.
Edit: to be clear, I didn't mean to imply their next thing is AI related, solely that they obviously know more about AI at Meta than e.g. XR at Meta, just because that's their expertise.
> I mention this because it feels analogous to military research, where people "dream" of how advanced the military is, how forward they are compared to public research... and yet, it seems to be a recurring myth they love to sustain.
I don't think that you can read this from the blog post at all, but it gives me a chuckle to think how the quest for AGI at Meta may be "The Men Who Stare at Goats" all over again.
I'm totally speculating. I have no extra information there.
It just makes me think of all the staff, technical staff, that left OpenAI recently. Altman was making grand claims about what was coming next.
Well, we know what followed, namely I don't think any researcher who left knowing what was in the pipeline feel like they missed much in terms of access.
Just checked BTW and ... premise looks fun but the score is too low https://www.rottentomatoes.com/m/men_who_stare_at_goats was it actually good as movie, not just the idea behind it?
It's more the idea behind it. Considering the great cast, there result could have been much better.
The non-fiction book behind it is probably better comparison than the film adaptation, if you think Meta are doing goat-staring (I don't think they're especially bad on this issue compared to their rivals).
That man has an infective enthusiasm. I remember the DCGAN paper inspired me to try getting the (Lua) Torch code to work, and I tried it on the Oxford flowers dataset early on. It worked surprisingly well, and Soumith Chintala even shared it around in social media, surprised at how well it worked on such a small dataset. Of course back then we didn't really appreciate the problem of mode collapse.
Pytorch and old Lua Torch were a pleasure to work with compared to the contemporary Tensorflow. Lots of S.C's code was copied around liberally, it had its quirks (I remember the DCGAN code had a pretty odd way of doing parameter passing) but it was also really easy to understand and made random people like me feel like we had suddenly stumbled onto something crazy powerful (which we had!). It was wonderfully hackable.
For anyone that’s curious, the underlying Torch library is also a joy to work with, as are the many other torch bindings. For example, Rust has tch and Burn which both work with libtorch.
PyTorch of course has the benefit of being dynamically debuggable. Can’t forget the first time I break pointed my pytorch model and wrote pytorch calls inside the terminal to inspect the behavior. That’s still something I miss a lot now that I’m working with only “fast” compiled code.
I wrote som truly awful code back in the day because of that but god it was glorious.
As a loyal JAX user, I hope they can play catchup. PyTorch has dominated the AI scene since TF1 fumbled the ball at 10th yard line. What Matt Johnson has done turning Autograd into JAX is hopefully going to be worthy of as much praise as what Soumith has received.
> PyTorch has dominated the AI scene since TF1 fumbled the ball at 10th yard line
can you explain why you think TensorFlow fumbled?
For me it was about 8 years ago. Back then TF was already bloated but had two weaknesses. Their bet on static compute graphs made writing code verbose and debugging difficult.
The few people I know back then used keras instead. I switched to PyTorch for my next project which was more "batteries included".
I personally believe TF1 was serving the need of its core users. It provided a compileable compute graph with autodiff, and you got very efficient training and inference from it. There was a steep learning curve, but if you got past it, things worked very very well. The distributed TF never really took off—it was buggy, and I think they made some wrong earlier bets in the design for performance reasons that they should have been sacrificed in favor of simplicity.
I believe some years after the TF1 release, they realized the learning curve was too steep, they were losing users to PyTorch. I think also the Cloud team was attempting to sell customers on their amazing DL tech, which was falling flat. So they tried to keep the TF brand while totally changing the product under the hood by introducing imperative programming and gradient tapes. They killed TF1, upsetting those users, while not having a fully functioning TF2, all the while having plenty of documentation pointing to TF1 references that didn’t work. Any new grad student made the simple choice of using a tool that was user-friendly and worked, which was PyTorch. And most old TF1 users hopped on the band wagon.
Imagine a total newbie trying to fine-tune an image classifier, reusing some open source example code, about a decade ago.
If their folder of 10,000 labelled images contains one image that's a different size to the others, the training job will fail with an error about unexpected dimensions while concatenating.
But it won't be able to say the file's name, or that the problem is an input image of the wrong size. It'll just say it can't concatenate tensors of different sizes.
An experienced user will recognise the error immediately, and will have run a data cleansing script beforehand anyway. But it's not experienced users who bounce from frameworks, it's newbies.
> An experienced user will recognise the error immediately, and will have run a data cleansing script beforehand anyway. But it's not experienced users who bounce from frameworks, it's newbies.
Even seasoned developers will bounce away from frameworks or libraries - no matter if old dogs or the next hot thing - if the documentation isn't up to speed or simple, common tasks require wading through dozens of pages of documentation.
Writing good documentation is hard enough, writing relevant "common usage examples" is even harder... but keeping them up to date and working is a rarely seen art.
And the greatest art of all of it is logging. Soooo many libraries refuse to implement detailed structured logging in internal classes (despite particularly Java and PHP offering very powerful mechanisms), making it much more difficult to troubleshoot problems in the field.
Do you have experience in both JAX and PyTorch? Why do you prefer JAX?
His homepage says he wants to build a robot. So he is probably going to work with robots for his next role.
He is an investor in Anthropic, didnt know you could do that working for Meta.
>>Every major AI company and hardware vendor are on a speed dial. This kind of power is really hard to give up. But curiosity ultimately won out in my head.
A simple feeling has such a power. May he gets an opportunity to create one more powerful tool before retiring.
This is the end of an era. Amazing work soumith.
Very proud as a Swiss that Soumith has a .ch domain!
I read one post on his blog and found that Adam Paszke reached out to the author and got an internship. I wonder if it was that easy to get an internship at FAIR. I thought that they hire only PhDs.
Nice, that is the dream career!
Counterfactual Regret Minimization irl
PyTorch is one of those tools that’s so simple and easy to take apart that you feel like you might’ve been able to make it yourself. I can’t imagine how much engineering effort was behind all those moments where I thought to myself, “of course it should work like that, how can it be any other way?”
Can anyone recommend a technical overview describing the design decisions PyTorch made that led it to win out?
The choice of the dynamic computation graph [1] of PyTorch made it easier to debug and implement, leading to higher adoption, even though running speed was initially slower (and therefore training cost higher).
Other decisions follow from this one.
Tensorflow started with static and had to move to dynamic at version 2.0, which broke everything. Fragmentation between tensorflow 1, tensorflow 2, keras, jax.
Pytorch's compilation of this computation graph erased the remaining edge of Tensorflow.
Is the battle over ? From a purely computational point, Pytorch solution is very far from optimal and billions of dollars of electricity and GPUs are burned every year, but major players are happy with circular deals to entrench their positions. So at the pace of current AI code development, probably one or two years before Pytorch is old history.
[1] https://www.geeksforgeeks.org/deep-learning/dynamic-vs-stati...
> at the pace of current AI code development, probably one or two years before Pytorch is old history.
Ehhh, I don’t know about that.
Sure, new AI techniques and new models are coming out pretty fast, but when I go to work with a new AI project, they’re often using a version of PyTorch or CUDA from when the project began a year or two ago. It’s been super annoying having to update projects to PyTorch 2.7.0 and CUDA 12.8 so I can run them on RTX 5000 series GPUs.
All this to say: If PyTorch was going to be replaced in a year or two, we’d know the name of its killer by now, and they’d be the talk of HN. Not to mention that at this point all of the PhDs flooding into AI startups wrote their grad work in PyTorch, it has a lot of network lock-in that an upstart would have to overcome by being way better at something PyTorch can never be good at. I don’t even know what that would be.
Bear in mind that it took a few years for Tensorflow to die out due to lock in, and we all knew about PyTorch that whole time.
I don't know the full list, but back when it came out, TF felt like a crude set of bindings to the underlying c++/CUDA workhorse. PyTorch felt, in contrast, pythonic. It was much closer in feeling to numpy.
I think it was mostly the eager evaluation that made it possible to debug every step in the network forward/backward passes. Tensorflow didn't have that at the time which made debugging practically impossible.
I’m not sure if such an overview exists, but when caffe2 was still a thing and JAX was a big contender dynamic vs static computational graphs seemed to be a major focus point for people ranking the frameworks.
The last few years must have been incredibly exhausting. Thanks for your work good luck and 73.
Respect.
Sounds like you had a momentous run.
If you take advice from reformed Internet trolls, consider turning off all your devices and trying to give yourself at least a week, but ideally a month offline staring at your new baby. You'll never get that time back and there's nothing your brain will appreciate more than loading up those memories as they grow.
Good luck.
[dead]
[flagged]
[flagged]