exo optimally splits up models based on the current network topology and device resources available. This enables you to run larger models than you would be able to on any single device.
The embeddings for Llama-3-8B are around 8KB-10KB. For Llama-3-70B theyβre around 32KB. These are small enough to send around between devices on a local network.
This kind of swarm compute is so cool and should be more common. Definitely gets us closer to frugal and salvage computing and permacomputing.
AIHorde is another example (but using peer compute).