Hi! We are working on an application where our model training could benefit from some of our partner’s datasets, but they can’t share them with us due to privacy concerns. Are there AI methods and frameworks to execute distributed training without sharing data centrally?
Hey! If you want to avoid the need to share raw data between participants, Federated Learning (FL) is definitely worth looking into.
FL is a method where multiple parties (clients) can collaboratively train a machine learning model without ever sharing their local data. Instead of sending data to a central server, each participant trains a model on their own dataset locally, then only shares the model updates (like weights or gradients) with a central aggregator. This is super helpful when working with sensitive data (think medical, personal, or proprietary datasets) where privacy is a top concern.
To break it down simply:
-
Each client trains on its own local data.
-
Only the model updates are shared—not the actual data.
-
A central server aggregates these updates to improve the global model.
-
The improved model is then sent back to the clients.
-
Repeat until the model converges.
There’s also Swarm Learning (SL), which is a decentralized variation—great if you want to eliminate reliance on a central server. It shifts coordination to the clients themselves, making it more robust and democratic.
If you’re working within the AI4EOSC platform, you can set up FL training using either NVIDIA’s NVFLARE or the Flower framework. Both tools make it easy to spin up a federated setup across different environments—be it local machines, AI4EOSC deployments, or even external cloud providers.
I put some references for you in here to take a look