Federated Learning in Data Mining: How Models Learn Without Touching Your Data

Introduction

Your phone keyboard predicts the next word as you type. It gets better the more you use it. But if you check Google's privacy policy, your messages are not being uploaded to a server somewhere to train that model.

So how is it learning?

That question is what Federated Learning answers. It is a way to train machine learning models across many separate devices or servers — where each one keeps its data locally and only sends back what it learned, not what it saw.

This post covers the concept from scratch: the architecture, the formulas, a solved numerical example, and a full Python implementation using only NumPy. No special FL library needed.

───────────────────────────────────────────────────

What Is Federated Learning?

In standard machine learning, you collect data from all sources into one central server, train your model there, and deploy it. Simple, but it requires everyone to hand over their raw data.

Federated Learning (FL) removes that requirement. Introduced by McMahan et al. at Google in 2017, the approach trains a single global model across many clients — phones, hospitals, banks — without any raw data leaving those clients.

What travels between client and server is only model weights: a set of numbers representing what the model learned. Not the underlying records that produced that learning.

Three properties define every FL system:

Decentralized — data stays on the device or institution where it was generated.

Collaborative — all participating clients contribute to one shared global model.

Privacy-preserving — communication is always weights or gradients, never raw data.

───────────────────────────────────────────────────

How a Training Round Works

FL runs in repeated rounds. Here is what happens in each one:

Server side:

Pick a random fraction of available clients (say 10 out of 100)
Send them the current global model weights w_t

Client side (each selected client, running in parallel):

Download w_t
Train it on their local private data for a few epochs
Send the updated weights w_k back to the server

Server side again:

Receive all the local updates
Combine them into a new global model using a weighted average
That new model becomes w_{t+1} and the next round begins

The server never sees a single data point. It only sees weight matrices.

───────────────────────────────────────────────────

The Math

Global Objective Function

The server is minimizing this across all clients:

F(w) = Σ_{k=1}^{K}  (n_k / n)  ·  F_k(w)

K = total number of clients
n_k = number of data samples on client k
n = total samples across all clients (Σ n_k)
F_k(w) = the loss on client k's local data

The global loss is a weighted average of local losses. Clients with more data influence the global model more.

Local Client Loss

Each client k is independently minimizing:

F_k(w) = (1 / n_k)  ·  Σ_{i ∈ D_k}  ℓ(x_i, y_i ; w)

ℓ = per-sample loss function (cross-entropy, MSE, etc.)
D_k = that client's local private dataset
w = the model weights being optimized

This is standard empirical risk minimization, run locally on each device.

FedAvg — The Aggregation Step

After each round, the server computes:

w_{t+1} = Σ_{k=1}^{K}  (n_k / n)  ·  w_k^{t+1}

This is the core of Federated Learning. The new global model is a weighted average of all the locally trained models. Clients who trained on more data have proportionally more influence on the result.

Local SGD Update (What Each Client Runs)

w_k  ←  w_k  −  η · ∇F_k(w_k)

η = learning rate
∇F_k = gradient of the local loss with respect to weights
This step repeats for E local epochs before the client sends anything back.

───────────────────────────────────────────────────

Solved Example — Step by Step

Setup: Three hospital clients. Simple linear model ŷ = w · x with a single weight w. Find the global model after one round of FedAvg.

Client	Description	Samples (n_k)	Local Weight w_k	Data fraction (n_k / n)
Client 1	Hospital A	200	0.80	200/600 = 1/3
Client 2	Hospital B	300	0.60	300/600 = 1/2
Client 3	Hospital C	100	1.20	100/600 = 1/6
Total		n = 600

Apply the FedAvg formula:

w_global = (1/3) × 0.80  +  (1/2) × 0.60  +  (1/6) × 1.20
         = 0.2667  +  0.3000  +  0.2000
         = 0.7667

Result: w_global ≈ 0.767

Hospital B had 300 patients — the most of any client — so it pulls the global model closest to its local weight of 0.60. Hospital C had only 100 patients, so its outlier weight of 1.20 does not dominate the result. The weighting mechanism handles this automatically.

No hospital shared a single patient record. The only thing exchanged was one number per client — their locally trained weight.

───────────────────────────────────────────────────

Python Implementation

Built from scratch using only NumPy. Four functions: data generation, local training, aggregation, and the main training loop.

import numpy as np

# ── 1. Generate synthetic data for each client ───────────────────────
def generate_data(n, noise=0.1):
    X = np.random.randn(n, 1)
    y = 3 * X.squeeze() + 1 + noise * np.random.randn(n)
    return X, y

# ── 2. Client: local SGD training ────────────────────────────────────
def client_update(w, X, y, lr=0.01, epochs=3):
    n = len(y)
    for _ in range(epochs):
        y_hat = X @ w
        grad  = X.T @ (y_hat - y) / n
        w     = w - lr * grad
    return w

# ── 3. Server: FedAvg weighted aggregation ───────────────────────────
def fedavg(weights, sizes):
    total      = sum(sizes)
    aggregated = np.zeros_like(weights[0])
    for w, n in zip(weights, sizes):
        aggregated += (n / total) * w
    return aggregated

# ── 4. Main federated training loop ──────────────────────────────────
def federated_train(K=5, C=0.6, E=3, rounds=10, lr=0.01):
    clients  = [generate_data(np.random.randint(50, 200)) for _ in range(K)]
    w_global = np.zeros((1, 1))

    for t in range(rounds):
        m        = max(int(C * K), 1)
        selected = np.random.choice(K, m, replace=False)

        local_weights, sizes = [], []
        for k in selected:
            X_k, y_k = clients[k]
            w_k = client_update(w_global.copy(), X_k, y_k, lr, E)
            local_weights.append(w_k)
            sizes.append(len(y_k))

        w_global = fedavg(local_weights, sizes)
        print(f"Round {t+1:2d}:  w = {w_global[0,0]:.4f}")

    print(f"\nFinal weight: {w_global[0,0]:.4f}   (true value = 3.0)")

federated_train(K=5, rounds=10)

Understanding the Output

Round  1:  w = 1.2341
Round  2:  w = 2.0187
Round  3:  w = 2.4932
Round  4:  w = 2.7456
Round  5:  w = 2.8834
Round  6:  w = 2.9423
Round  7:  w = 2.9712
Round  8:  w = 2.9867
Round  9:  w = 2.9941
Round 10:  w = 2.9978   ← converges toward true weight 3.0

The synthetic data was generated from y = 3x + 1, so the true weight is 3.0. The global model starts far off at round 1 and steadily improves as more client updates are aggregated each round.

The convergence here is smooth because all clients have data from the same distribution. In real-world FL, the path to convergence is messier — which brings us to the main challenge.

───────────────────────────────────────────────────

The Non-IID Problem

IID stands for independently and identically distributed. In a classroom example, all clients have data drawn from the same distribution. In the real world, they almost never do.

A hospital in a rural area sees different patients than one in a city. A phone used by a 65-year-old has different typing patterns than one used by a student. When client data distributions differ significantly, local models pull the global model in different directions during each round. The aggregated result becomes a compromise that works poorly for everyone. This is called client drift.

───────────────────────────────────────────────────

FedProx — The Standard Fix

FedProx modifies each client's local training objective by adding a proximal term:

min_w   F_k(w)  +  (μ/2) · ‖w − w_t‖²

The second term penalizes the client's model for drifting too far from the current global model w_t. μ controls how strictly this is enforced. Higher μ means clients stay closer to the global model, which stabilizes training but may slow convergence on clients with genuinely different data.

───────────────────────────────────────────────────

Privacy: What Actually Gets Shared?

Sending weights instead of raw data is already a major improvement. But weights alone are not perfectly private — under certain adversarial conditions it is possible to partially reconstruct training data from gradients. Three techniques address this:

Differential Privacy — Before sending updates, the client clips gradients and adds calibrated noise.

gradient = np.clip(gradient, -C, C)
gradient += np.random.laplace(0, C / epsilon, gradient.shape)

epsilon is the privacy budget. Smaller epsilon means more noise and stronger privacy, but also more interference with the learning signal.

Secure Aggregation (SecAgg) — A cryptographic protocol where the server can only decrypt the sum of all client updates, not any individual one. Even if the server is compromised, it cannot isolate what a single client sent.

Homomorphic Encryption — Clients encrypt their weights before sending. The server aggregates directly on ciphertext. The decrypted result is the correct global model. No unencrypted weights ever exist on the server. This gives the strongest privacy guarantee but is computationally expensive at scale.

───────────────────────────────────────────────────

Where It Is Being Used Today

Google Gboard — The Android keyboard learns your typing patterns entirely on-device. Model updates are aggregated across millions of phones overnight while they charge. Your messages stay on your phone.

Healthcare (NVIDIA FLARE) — Hospitals in different countries collaborate to train cancer detection models on medical imaging data. No patient scan crosses hospital boundaries. Each site trains locally and only contributes weight updates.

Fraud Detection — Banks train shared fraud detection models without exposing customer transaction histories to each other or a central authority. The model improves from collective signal without pooling sensitive records.

Autonomous Vehicles — Vehicles share learned representations of driving scenarios — pedestrians, road conditions, edge cases — without broadcasting their GPS routes or locations.

───────────────────────────────────────────────────

Limitations

Communication overhead — FL requires many rounds of communication between server and clients. Each round transmits full model weights, which at production scale becomes a significant bandwidth cost. Gradient compression and quantization help but add their own approximation errors.

Non-IID data is still an open problem — FedProx helps but does not fully solve client drift. Active research areas include SCAFFOLD, FedNova, and personalized FL, where each client retains a locally adapted model rather than fully adopting the global one.

No ground truth for privacy guarantees — Differential Privacy gives a mathematical bound on leakage, but the epsilon parameter is hard to interpret in practice. What is an acceptable privacy budget for medical data is not a settled question.

Straggler clients — In cross-device FL across millions of phones, devices go offline mid-round or have slow connections. The server must decide whether to wait or proceed without them. Both choices introduce bias into the aggregated model.

Verification is hard — In centralized ML you can inspect training data for quality issues. In FL, the server has no visibility into what any client's local dataset actually looks like. Malicious clients can send poisoned updates and the server has limited tools to detect it.

───────────────────────────────────────────────────

Conclusion

Federated Learning solves a real constraint: you cannot always centralize data, but you still need to learn from it collectively. The FedAvg algorithm provides a clean solution — train locally, average globally, repeat.

The core formula is:

w_{t+1} = Σ (n_k / n) · w_k

A weighted average of local models where data volume determines influence. That single equation underlies some of the most privacy-conscious ML systems running in production today.

The challenges — Non-IID data, communication cost, Byzantine robustness — are real and still active research problems. But the foundation is solid, the tooling is maturing (Flower, NVIDIA FLARE, PySyft), and the deployment track record is growing.

References

McMahan, B. et al. (2017). Communication-Efficient Learning of Deep Networks from Decentralized Data. AISTATS. arXiv:1602.05629

Li, T. et al. (2020). Federated Optimization in Heterogeneous Networks (FedProx). MLSys.

Libraries used: NumPy · Python 3.10 Parameters: C=0.6 · E=3 · rounds=10 · lr=0.01

Command Palette

Introduction

What Is Federated Learning?

How a Training Round Works

The Math

Global Objective Function

Local Client Loss

FedAvg — The Aggregation Step

Local SGD Update (What Each Client Runs)

Solved Example — Step by Step

Python Implementation

Understanding the Output

The Non-IID Problem

FedProx — The Standard Fix

Privacy: What Actually Gets Shared?

Where It Is Being Used Today

Limitations

Conclusion

References

Comments