8 posts tagged with "federated learning"

View All Tags

Reducing Training Time in Cross-Silo Federated Learning using Multigraph Topology (Part 4)

In previous post, we have mentioned multigraph parsing proccess, how to train a multigraph under decentralized federated learning, and experimental setups for Multigraph. In the post, we will mention the effectiveness and efficiency of multigraph topology design under different configurations.

Our code can be found at: https://github.com/aioz-ai/MultigraphFL

1. Cycle Time Comparison

Table 1 shows the cycle time of our method in comparison with other recent approaches. This table illustrates that our proposed method significantly reduces the cycle time in all setups with different networks and datasets. In particular, compared to the state-of-the-art RING, our method reduces the cycle time by 2.182.18, 1.51.5, 1.741.74 times in average in the FEMNIST, iNaturalist, Sentiment140 dataset, respectively. Our method also clearly outperforms MACHA, MACHA(+), and MST by a large margin. The results confirm that our multigraph with isolated nodes helps reduce the cycle time in federated learning.

From Table1, our multigraph achieves the minimum improvement under the Amazon network in all three datasets. This can be explained that, under the Amazon network, our proposed topology does not generate many isolated nodes. Hence, the improvement is limited. Intuitively, when there are no isolated nodes, our multigraph will become the overlay, and the cycle time of our multigraph will be equal to the cycle time of the overlay in RING.


2. Isolated Node Analysis

Isolated Nodes vs. Network Configuration. The numbers of isolated nodes vary based on the network configuration (Amazon, Gaia, Exodus, etc.). The parameter tt (maximum number of edges between two nodes), and the delay time which is identified by many factors (geometry distance, model size, computational cost based on tasks, bandwidth, etc also affect the process of generating isolated nodes. Table 2 illustrates the effectiveness of isolated nodes in different network configurations. Specifically, we conduct experiments on the FEMNIST dataset using five network configurations (Gaia, Amazon, Geant, Exodus, Ebone). We can see that our cycle time compared with RING is reduced significantly when more communication rounds or graph states have isolated nodes. Tab-2

Table 2: The effectiveness of isolated nodes under different network configurations. All experiments are trained with 6400 communication rounds on FEMNIST dataset.We then record the number of states and rounds that have the appearance of isolated nodes and compare our cycle time with RING.

Isolated Nodes vs. RING vs. Random Strategy. Isolated nodes play an important role in our method as we can skip the model aggregation step in the isolated nodes. In practice, we can have a trivial solution to create isolated nodes by randomly removing some nodes from the overlay of RING. Table 3 shows the experiment results in two scenarios on FEMNIST dataset and Exodus Network: i} Randomly remove some silos in the overlay of RING, and ii} Remove the most inefficient silos (i.e., silos with the longest delay) in the overlay of RING. From Table 3, the cycle time reduces significantly when two aforementioned scenarios are applied. However, the accuracy of the model also drops significantly. This experiment shows that although randomly removing some nodes from the overlay of RING is a trivial solution, it can not maintain model accuracy. On the other hand, our multigraph not only reduces the cycle time of the model but also preserves the accuracy. This is because our multigraph can skip the aggregation step of the isolated nodes in a communication round. However, in the next round, the delay time of these isolated nodes will be updated, and they can become normal nodes and contribute to the final model.


Table 3: The cycle time and accuracy of our multigraph vs. RING with different criteria.

Isolated Nodes Illustration. Figure belows shows a detailed illustration of our algorithm with the isolated nodes in a real-world training scenario. The experiment is conducted on Gaia network geometry and their corresponding hardware for supporting link latency computation. The image classification task is chosen for this benchmarking by using FEMNIST dataset and CNN backbone provided by Marfod \etal. Hence, we keep the model transmitted size at 4.624.62 Mb, all access links have 1010 Gbps traffic capacity, the number of local updates is set to 11, and the maximum number of edges tt is set to 33. As shown in this Figure, although there are no isolated nodes in the initialized state, the number of isolated nodes increases in the next consequence states with a vast number (4 nodes). This circumstance leads to a 4\sim 4 times reduction in cycle time compared to the initialized state. The appearance of isolated nodes also greatly reduces the connection between silos by 3.6\sim 3.6 times, from 1111 down to 33 connections, and also discarded ones all have high latency.


3. Multigraph Ablation Study

Accuracy Analysis. In federated learning, improving the model accuracy is not the main focus of topology designing methods. However, preserving the accuracy is also important to ensure model convergence. Table 4 shows the accuracy of different topologies after 6,4006,400 communication training rounds on the FEMNIST dataset. This table illustrates that our proposed method achieves competitive accuracy with other topology designs. This confirms that our topology can maintain the accuracy of the model, while significantly reducing the training time.


Table 4: Accuracy comparison between different topologies. The experiment is conducted using the FEMNIST dataset. The accuracy is reported after 6,4006,400 communication rounds in all methods.

Convergence Analysis. Figure belows shows the training loss versus the number of communication rounds and the wall-clock time under Exodus network using the FEMNIST dataset. This figure illustrates that our proposed topology converges faster than other methods while maintaining the model accuracy. We observe the same results in other datasets and network setups.

Cycle Time and Accuracy Trade-off. In our method, the maximum number of edges between two nodes tt mainly affects the number of isolated nodes. This leads to a trade-off between the model accuracy and cycle time. Table 5 illustrates the effectiveness of this parameter. When t=1t = 1, we technically consider there are no weak connections and isolated nodes. Therefore, our method uses the original overlay from RING. When tt is set higher, we can increase the number of isolated nodes, hence decreasing the cycle time. In practice, too many isolated nodes will limit the model weights to be exchanged between silos. Therefore, models at isolated nodes are biased to their local data and consequently affect the final accuracy.


Table 5: Cycle time and accuracy trade-off with different value of tt, i.e., the maximum number of edges between two nodes.

4. Conclusion

We proposed a new multigraph topology for cross-silo federated learning. Our method first constructs the multigraph using the overlay. Different graph states are then parsed from the multigraph and used in each communication round. Our method significantly reduces the cycle time by allowing the isolated nodes in the multigraph to do model aggregation without waiting for other nodes. The intensive experiments on three datasets show that our proposed topology achieves new state-of-the-art results in all network and dataset setups.

Reducing Training Time in Cross-Silo Federated Learning using Multigraph Topology (Part 3)

In previous part, we have investigated that how delay time and cycle time is affected by the modification of multigraph in the design of the topology. Also, we will explore how multigraph can be constructed. In this post, we will mention multigraph parsing proccess, how to train a multigraph under decentralized federated learning, and experimental setups for Multigraph.

Our code can be found at: https://github.com/aioz-ai/MultigraphFL

1. Multigraph Parsing

In Algorithm~\ref{alg:state_form}, we parse multigraph Gm\mathcal{G}_m into multiple graph states Gms\mathcal{G}_m^s. Graph states are essential to identify the connection status of silos in a specific communication round to perform model aggregation. In each graph state, our goal is to identify the isolated nodes. During the training, isolated nodes update their weights internally and ignore all weakly-connected edges that connect to them.

To parse the multigraph into graph states, we first identify the maximum of states in a multigraph smaxs_{\max} by using the least common multiple (LCM). We then parse the multigraph into smaxs_{\max} states. The first state is always the overlay since we want to make sure all silos have a reliable topology at the beginning to ease the training. The reminding states are parsed so there is only one connection between two nodes. Using our algorithm, some states will contain isolated nodes. During the training process, only one graph state is used in a communication round. Figure below illustrates the training process in each communication round using multiple graph states.

2. Multigraph Training

In each communication round, a state graph Gms\mathcal{G}_m^s is selected in a sequence that identifies the topology design used for training. We then collect all strongly-connected edges in the graph state Gms\mathcal{G}_m^s in such a way that nodes with strongly-connected edges need to wait for neighbors, while the isolated ones can update their models. We train our multigraph with DPASGD algorithm:

wi(k+1)={jNi++{i}Ai,jwj(kh),k0&Ni++>1,wi(k)αk1bh=1bLi(wi(k),ξi(h)(k)),otherwise.w_{i}\left(k + 1\right) = \begin{cases} \sum_{j \in \mathcal{N}_{i}^{++} \cup{\{i\}}}A_{i,j}w_{j}\left(k - h\right), \forall k \equiv 0 \& \left|\mathcal{N}_{i}^{++}\right| > 1 ,\\ w_{i}\left(k\right)-\alpha_{k}\frac{1}{b}\sum^b_{h=1}\nabla L_i\left(w_{i}\left(k\right),\xi_i^{\left(h\right)}\left(k\right)\right), otherwise. \end{cases}

where (kh)(k- h) is the index of the considered weights; hh is initialized to 00 and h=h+1ekh(i,j)=0h = h + 1 \forall e_{k-h}(i,j) = 0. Through Equation above, at each state, if a silo is not an isolated node, it must wait for the model from its neighbor to update its weight. If a silo is an isolated node, it can use the model in its neighbor from the (kh)(k-h) round to update its weight immediately. The training procedure is described as below:

3. Algorithm Complexity

It is trivial to see that the complexity of training procedure is O(n2)\mathcal{O}(n^2). In practice, since the cross-silo federated learning setting has only a few hundred silos (n<500n<500), the time to execute our algorithms is just a tiny fraction of training time. Therefore, our proposed topology still can significantly reduce the overall wall-clock training time.

4. Experimental Setups

Datasets. We use three datasets in our experiments: Sentiment140, iNaturalist, and FEMNIST. All datasets and the pre-processing process are conducted by following recent works. Table below shows the dataset setups in detail.

Network. We consider five distributed networks in our experiments: Exodus, Ebone, Géant, Amazon and Gaia. The Exodus, Ebone, and Géant are from the Internet Topology Zoo. The Amazon and Gaia network are synthetic and are constructed using the geographical locations of the data centers.

Baselines. We compare our multigraph topology with recent state-of-the-art topology designs for federated learning: STAR, MATCHA, MATCHA(+), MST, and RING.

Hardware Setup. Since measuring the cycle time is crucial to compare the effectiveness of all topologies in practice, we test and report the cycle time of all baselines and our method on the same NVIDIA Tesla P100 16Gb GPUs. No overclocking is used.

Time Simulator. We adapted PyTorch with the MPI backend to run DPASGD and DPASGD++ on a GPU cluster. We take advantage of the network simulator, the Time Simulator, which uses an arbitrary topology and computation times of silos as input to calculate the time instants at which local models are computed. The wall-clock time is reconstructed by this time simulator needs thorough understanding of the topology, including all factors mentioned in Delay Equations in each network configuration. The related configuration information is already provided in GAIA Network, and the simulator is created by Marfod \etal.


In the next post, we will mention the effectiveness and efficiency of multigraph topology design under different configurations.

Reducing Training Time in Cross-Silo Federated Learning using Multigraph Topology (Part 1)

Federated learning is an active research topic since it enables several participants to jointly train a model without sharing local data. Currently, cross-silo federated learning is a popular training setting that utilizes a few hundred reliable data silos with high-speed access links to training a model. While this approach has been widely applied in real-world scenarios, designing a robust topology to reduce the training time remains an open problem. In this paper, we present a new multigraph topology for cross-silo federated learning. We first construct the multigraph using the overlay graph. We then parse this multigraph into different simple graphs with isolated nodes. The existence of isolated nodes allows us to perform model aggregation without waiting for other nodes, hence effectively reducing the training time. Intensive experiments on three public datasets show that our proposed method significantly reduces the training time compared with recent state-of-the-art topologies while maintaining the accuracy of the learned model.

Our code can be found at: https://github.com/aioz-ai/MultigraphFL

1. Introduction

Federated learning involves training models using remote devices or isolated data centers while keeping the data localized to respect user privacy policies. According to available literature, there are two prominent training scenarios: the "cross-device" scenario, which includes numerous unreliable edge devices with limited computational capacity and slow connection speeds, and the "cross-silo" scenario, which features a smaller number of reliable data silos with powerful computing resources and high-speed access links. Recently, the cross-silo scenario has gained traction in various federated learning applications.

In practical terms, federated learning represents a promising research avenue that allows us to harness the capabilities of machine learning techniques while upholding user privacy. Key obstacles in federated learning encompass issues like model convergence, communication bottlenecks, and disparities in data distributions across different silos. A commonly employed federated training approach involves establishing a central node responsible for overseeing the training process and aggregating contributions from all clients. However, a drawback of this client-server approach is the potential for communication bottlenecks, especially when dealing with a large number of clients. To mitigate this limitation, recent research has explored the concept of decentralized or peer-to-peer federated learning, where communication occurs via a peer-to-peer network topology, eliminating the need for a central node. Nevertheless, a major challenge in decentralized federated learning remains achieving rapid training while ensuring model convergence and preserving model accuracy.

In federated learning, the structure of communication networks holds significant importance. Specifically, an efficient network design contributes to quicker convergence, resulting in reduced training duration and energy consumption, as measured by worst-case convergence bounds within the topology's framework. Additionally, the topology's design has direct implications for other training-related challenges, including network congestion, overall model accuracy, and energy efficiency. The development of a resilient network structure capable of minimizing training time while preserving model accuracy remains an ongoing challenge in federated learning. Our paper is dedicated to devising a novel network design tailored for cross-silo federated learning, a prevalent scenario in practical applications.

Figure 1. We conducted a comparative analysis of various network structures using the FEMNIST dataset and the Exodus network. After completing 6,400 communication rounds, we measured and reported both the accuracy and the total wall-clock training time (or overhead time). Notably, our approach resulted in a substantial reduction in training duration while upholding model accuracy.

Lately, various network configurations have emerged for cross-silo federated learning. For instance, the STAR topology involves an orchestrator averaging all models during each communication round. Another approach, known as MATCHA, divides potential communications into pairs of clients, with random selection for model transmission in each round. Additionally, the RING topology employs max-plus linear systems. Despite progress in this field, challenges persist, including access link congestion, straggler effects, and the establishment of diverse topologies across communication rounds.

In this paper, we introduce a novel multigraph topology inspired by recent advancements in federated learning. Our aim is to enhance the efficiency of cross-silo federated learning. Our approach involves constructing a multigraph based on the overlay of existing network topologies. Subsequently, we decompose this multigraph into simpler graphs, each featuring only a single edge connecting two nodes. These individual graphs are referred to as "states" within the multigraph. Importantly, each state can involve isolated nodes that perform model aggregation independently, reducing the cycle time in each communication round significantly. Our intensive experiments demonstrate that our proposed topology outperforms existing state-of-the-art methods by a wide margin in terms of training time for cross-silo federated learning, as illustrated in Figure.1.

2. Overview

Federated Learning is recognized for its capacity to safeguard data privacy. In its modern incarnation, federated learning adopts a centralized network design, where a central node collects gradients from client nodes to update a global model. Early contributions in federated learning research include pioneering work and seminal papers by various researchers. Subsequent extensions and developments in federated learning and related distributed optimization algorithms have been proposed. Federated Averaging (FedAvg), initially introduced by one group, has inspired variations and other recent state-of-the-art model aggregation techniques, addressing convergence and the non-IID (non-identically and independently distributed) data challenge. Despite its simplicity, the client-server approach faces communication and computational bottlenecks at the central node, particularly when dealing with a large number of clients.

Decentralized Federated Learning flips the traditional federated learning model, enabling direct interactions between siloed data nodes, eliminating the necessity for a central coordinating node. While this approach mitigates communication congestion at a central point, optimizing a fully peer-to-peer network presents substantial challenges. The decentralized periodic averaging stochastic gradient descent method has demonstrated convergence rates comparable to centralized algorithms, making large-scale model training feasible. Furthermore, previous research has conducted systematic analyses of decentralized federated learning. A recent advancement involves leveraging a knowledge distillation mechanism to facilitate collaboration among silos in decentralized federated scenarios while preserving privacy among neighboring nodes.

Communication Topology plays a fundamental role in influencing the complexity and convergence behavior of federated learning. Numerous efforts have been dedicated to improving the efficiency of communication topologies, including star-shaped topologies and optimized-shaped topologies. In particular, a spanning tree topology has been introduced to reduce training time.

The STAR topology is designed for orchestrating the averaging of model updates in each communication round. Meanwhile, the MATCHA approach focuses on accelerating the training process through decomposition sampling. Recognizing the impact of straggler effects on communication round duration, methods for selecting the degree of a regular topology have been explored.

The RING topology is tailored for cross-silo federated learning and leverages the principles of max-plus linear systems. A sample-induced topology has been introduced, capable of effectively recovering the performance of existing SGD-based algorithms and their corresponding convergence rates. In a recent comprehensive survey, various models, frameworks, and algorithms related to network topologies in federated learning have been explored.

Multigraph is a concept that originates from traditional mathematics. In conventional terms, a "graph" typically denotes a simple graph without loops or multiple edges between two nodes. In contrast, a multigraph allows for the presence of multiple edges between two nodes. In the realm of deep learning, multigraphs have found utility across various domains, including clustering, medical image processing, traffic flow prediction, activity recognition, recommendation systems, and cross-domain adaptation. In this research, we employ a multigraph construction to facilitate isolated nodes and expedite training in cross-silo federated learning.

3. Preliminaries

3.1 Federated Learning

In federated learning, silos do not share their local data, but still periodically transmit model updates between them. Given NN siloed data centers, the objective function for federated learning is:

minwRdi=1NpiEξi[Li(w,ξi)],\min_{\textbf{w} \in \mathbb R^d} \sum^{N}_{i=1}p_i E_{\xi_i}\left[ L_{i}\left(\textbf{w}, \xi_i\right)\right],

where Li(w,ξi)L_{i}(\textbf{w}, \xi_i) is the loss of model parameterized by the weight wRd\textbf{w} \in \mathbb R^d, ξi\xi_i is an input sample drawn from data at silo ii, and the coefficient pi>0p_i>0 specifies the relative importance of each silo. Recently, different distributed algorithms have been proposed to optimize the equation. In this work, DPASGD is used to update the weight of silo ii in each training round as follows:

wi(k+1)={jNi+{i}Ai,jwj(k),if k0(mod u+1),wi(k)αk1bh=1bLi(wi(k),ξi(h)(k)),otherwise.\textbf{w}_{i}\left(k + 1\right) = \\ \begin{cases} \sum_{j \in \mathcal{N}_i^{+} \cup{\{i\}}}\textbf{A}_{i,j}\textbf{w}_{j}\left(k\right), \\\qquad\qquad\qquad\qquad\qquad \text{if k} \equiv 0 \left(\text{mod }u + 1\right),\\ \textbf{w}_{i}\left(k\right)-\alpha_{k}\frac{1}{b}\sum^b_{h=1}\nabla L_i\left(\textbf{w}_{i}\left(k\right),\xi_i^{\left(h\right)}\left(k\right)\right), \\\qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad\text{otherwise.} \end{cases}

where bb is the batch size, i,ji,j denote the silo, uu is the number of local updates, αk>0\alpha_k > 0 is a potentially varying learning rate at kk-th round, ARN×N\textbf{A} \in R^{N \times N} is a consensus matrix with non-negative weights, and Ni+\mathcal{N}_i^{+} is the in-neighbors set that silo ii has the connection to.

3.2 Multigraph for Federated Learning

Connectivity and Overlay. We consider the \textit{connectivity} Gc=(V,Ec)\mathcal{G}_c = (\mathcal{V}, \mathcal{E}_c) as a graph that captures possible direct communications among silos. Based on its definition, the connectivity is often a fully connected graph and is also a directed graph. % whenever the upload and download are set during learning. The \textit{overlay} Go\mathcal{G}_o is a connected subgraph of the connectivity graph, i.e., Go=(V,Eo)\mathcal{G}_o = (\mathcal{V}, \mathcal{E}_o), where EoEc\mathcal E_o \subset \mathcal E_c. Only nodes directly connected in the overlay graph Go\mathcal{G}_o will exchange the messages during training.

Multigraph. While the connectivity and overlay graph can represent different topologies for federated learning, one of their drawbacks is that there is only one connection between two nodes. In our work, we construct a \textit{multigraph} Gm=(V,Em)\mathcal{G}_m = (\mathcal{V}, \mathcal{E}_m) from the overlay Go\mathcal{G}_o. The multigraph can contain multiple edges between two nodes. In practice, we parse this multigraph to different graph states, each state is a simple graph with only one edge between two nodes.

In the multigraph Gm\mathcal{G}_m, the connection edge between two nodes has two types: \textit{strongly-connected} edge and \textit{weakly-connected} edge. Under both strong and weak connections, the participated nodes can transmit their trained models to their out-neighbours Ni\mathcal{N}_i^{-} or download models from their in-neighbours Ni+\mathcal{N}_i^{+}. However, in a strongly-connected edge, two nodes in the graph must wait until all upload and download processes between them are finished to do model aggregation. On the other hand, in a weakly-connected edge, the model aggregation process in each node can be established whenever the previous training process is finished by leveraging up-to-date models which have not been used before from the in-neighbours of that node.

State of Multigraph Given a multigraph Gm\mathcal{G}_m, we can parse this multigraph into different simple graphs with only one connection between two nodes (either strongly-connected or weakly-connected). We denote each simple graph as a state Gms\mathcal{G}_m^s of the multigraph.

Isolated Node. A node is called isolated when all of its connections to other nodes are weakly-connected edges. Figure.2 shows the graph concepts and isolated nodes.

Figure 2. Example of connectivity, overlay, multigraph, and a state of our multigraph. Blue node is an isolated node. Dotted line denotes a weakly-connected edge.


In the next post, we will mention the delay time and cylce time also how multigraph can be constructed.

Reducing Training Time in Cross-Silo Federated Learning using Multigraph Topology (Part 2)

In previous paart, we how explore decentralized federated learning, how and why multigraph is proposed to improve training process. In this part, we will investigate that how delay time and cycle time is affected by the modification of multigraph in the design of the topology. Also, we will explore how multigraph can be constructed.

Our code can be found at: https://github.com/aioz-ai/MultigraphFL

1. Delay time in multigraph

A delay to an edge e(i,j)e(i, j) is the time interval when node jj receives the weight sending by node ii, which can be defined by:

d(i,j)=u×Tc(i)+l(i,j)+MO(i,j),d(i,j) = u \times T_c(i) + l(i,j) + \frac{M}{O(i,j)},

where Tc(i)T_{c}(i) denotes the time to compute one local update of the model; uu is the number of local updates; l(i,j)l(i,j) is the link latency; MM is the model size; O(i,j)O(i, j) is the total network traffic capacity. However, unlike other communication infrastructures, the multigraph only contains connections between silos without other nodes such as routers or amplifiers. Thus, the total network traffic capacity O(i,j)=min(CUP(i)Ni,CDN(j)Ni+)O(i,j) = \text{min}\left(\frac{C_{\rm{UP}}(i)}{\left|{\mathcal{N}_{i}^{-}}\right|}, \frac{C_{\rm{DN}}(j)}{\left|\mathcal{N}_{i}^{+}\right|}\right) where CUPC_{\rm{UP}} and CDNC_{\rm{DN}} denote the upload and download link capacity. Note that the upload and download processes can happen in parallel.

Since multigraph can contain multiple edges between two nodes, we extend the definition of the delay in the previous equation to dk(i,j)d_k(i,j), with kk is the kk-th communication round during the training process, as:

dk+1(i,j)={dk(i,j),if ek+1(i,j)=1 and ek(i,j)=1max(u×Tc(j),dk(i,j)dk1(i,j)),if ek+1(i,j)=1 and ek(i,j)=0τk(Gm)+dk1(i,j)),if ek+1(i,j)=0 and ek(i,j)=1τk(Gm),otherwised_{k+1}(i,j) = \begin{cases} d_k(i,j), \\\qquad \qquad \text{if } e_{k+1}(i,j) = 1\text{ and }e_{k}(i,j) = 1\\ \text{max}( u \times T_c(j),d_{k}(i,j) - d_{k-1}(i,j)), \\\qquad\qquad\text{if }e_{k+1}(i,j) = 1\text{ and }e_{k}(i,j) = 0\\ \tau_k(\mathcal{G}_m) + d_{k-1}(i,j)), \\\qquad\qquad\text{if } e_{k+1}(i,j) = 0\text{ and }e_{k}(i,j) = 1\\ \tau_k(\mathcal{G}_m), \\\qquad\qquad\text{otherwise} \end{cases}

where e(i,j)e(i,j)==0\mathbb{0} is weakly-connected edge, e(i,j)e(i,j)==1\mathbb{1} is strongly-connected edge; $ \tau_k(\mathcal{G}_m)$ is the cycle time at the kk-th computation round during the training process.

In general, using introduced equation, \textit{the delay of the next communication round dk+1d_{k+1} is updated based on the delay of the previous rounds} and other factors, depending on the edge type connection.

2. Cycle time in multigraph

The cycle time per round is the time required to complete a communication round. In this work, we define the cycle time per round is the maximum delay between all silo pairs with strongly-connected edges. Therefore, the average cycle time of the entire training is:

τ(Gm)=1kk=0k1(maxjNi++{i},iV(dk(j,i))),\tau(\mathcal{G}_m) =\frac{1}{k }\sum^{k-1}_{k=0} \left(\underset{j \in \mathcal{N}^{++}_{i} \cup\{i\}, \forall i \in \mathcal{V}}{\text{max}} \left(d_k\left(j,i\right)\right)\right),

where Ni++\mathcal{N}_{i}^{++} is an in-neighbors silo set of ii whose edges are strongly-connected.

3. Multigraph Construction

Algorithm 1 describes our methods to generate the multigraph Gm\mathcal{G}_m with multiple edges between silos. The algorithm takes the overlay Go\mathcal{G}_o as input. Similar to~\cite{marfoq2020throughput}, we use the Christofides algorithm to obtain the overlay. In Algorithm 1, we establish multiple edges that indicate different statuses (strongly-connected or weakly-connected). To identify the total edges between a silo pair, we divide the delay d(i,j)d(i,j) by the smallest delay dmind_{\min} overall silo pairs, and compare it with the maximum number of edges parameter tt (t=5t=5 in our experiments). \textit{We assume that the silo pairs with longer delay will have more weakly-connected edges, hence potentially becoming isolated nodes}. Overall, we aim to increase the number of weakly-connected edges, which generate more isolated nodes to speed up the training process. Note that, from Algorithm 1, each silo pair in the multigraph should have one strongly-connected edge and multiple weakly-connected edges. The role of the strongly-connected edge is to make sure that two silos have a good connection in at least one communication round.


In the next post, we will mention multigraph parsing proccess and how to train a multigraph under decentralized federated learning.

Deep Federated Learning for Autonomous Driving (Part 2)

In previous part, we have discussed about Autonomous driving FADNetwork. In this post, we will verify the effectiveness and efficiency of it.

Our source code can be found at: https://github.com/aioz-ai/FADNet

1. Experimental Setup

Udacity. We use the popular Udacity dataset to evaluate our results. We only use front-forwarded images in this dataset in our experiment. We use 55 sequences for training and 11 for testing. The training sequences are assigned randomly to different silos depending on the federated topology (i.e., Gaia or NWS).

Carla. Since the Udacity dataset is collected in the real-world environment, changing the weather or lighting conditions is not easy. To this end, we collect more simulated data in the Carla simulator. We have applied different lighting (morning, noon, night, sunrise, sunset) and weather conditions (cloudy, rain, heavy rain, wet streets, windy, snowy) when collecting the data. We have generated 73,23573,235 samples distributed over 1111 sequences of scenes.

Gazebo. Since both the Udacity and Carla datasets are collected in outdoor environments, we also employ Gazebo to collect data for autonomous navigation in indoor scenes. We use a simulated mobile robot and the built-in scenes to collect data. Table.1 shows the statistics of three datasets. We use 80%80\% of the collected data in Gazebo and Carla data for training, and the rest 20%20\% for testing.

Figure 1. Visualization of sample images in three datasets: Udacity (first row), Gazebo (second row), and Carla (third row).

DatasetTotal samplesAverage samples in each silo (Gaia)Average samples in each silo (NWS)

Table 1. The Statistic of Datasets in Our Experiments.

Network Topology. We conduct experiments on two topologies: the Internet Topology Zoo (Gaia), and the North America data centers (NWS). We use Gaia topology in our main experiment and provide the comparison of two topologies in our ablation study.

Training. The model in a silo is trained with a batch size of 3232 and a learning rate of 0.0010.001 using Adam optimizer. We follow the training process to obtain a global weight of all silos. The training process is conducted with 30003000 communication rounds and each silo has one NVIDIA 1080 11 GB GPU for training. Note that, one communication round is counted each time all silos have finished updating their model weights.

Baselines. We compare our results with various recent methods, including Random baseline and Constant baseline, Inception-V3, MobileNet-V2, VGG-16, and Dronet. All these methods use the Centralized Local Learning (CLL) strategy (i.e., the data are collected and trained in one local machine.) For distributed learning, we compare our Deep Federated Learning (DFL) approach with the Server-based Federated Learning (SFL) strategy. As the standard practice, we use the root-mean-square error (RMSE) metric to evaluate the results.

2. Results

Table 2 summarises the performance of our method and recent state-of-the-art approaches. We notice that our FADNet is trained using the proposed peer-to-peer DFL using the Gaia topology with 11 silos. This table clearly shows our FDANet + DFL outperforms other methods by a fair margin. In particular, our FDANet + DFL significantly reduces the RMSE in Gazebo and Carla datasets, while slightly outperforms DroNet in the Udacity dataset. These results validate the robustness of our FADNet while is being trained in a fully decentralized setting. Table 3 also shows that with a proper deep architecture such as our FADNet, we can achieve state-of-the-art accuracy when training the deep model in FL. Fig. 2 illustrates the spatial support regions when our FADNet making the prediction. Particularly, we can see that FADNet focuses on the ``line-like" patterns in the input frame, which guides the driving direction.

ArchitectureLearning MethodUdacityGazeboCarla#Params
FADNet (ours)DFL0.1070.0690.203317,729

Table 2. Performance comparison of different architectures on the Udacity, Gazebo, and Carla datasets. The number of parameters (#Params) is also provided.

Figure 2. Spatial support regions for predicting steering angle in three datasets. In most cases, we can observe that our FADNet focuses on ``line-like” patterns to predict the driving direction.

3. Ablation Studies

Effectiveness of our DFL.

Table 3 summarises the accuracy of DroNet and our FADNet when we train them using different learning methods: CLL, SFL, and our peer-to-peer DFL. From this table, we can see that training both DroNet and FADNet with our peer-to-peer DFL clearly improves the accuracy compared with the SFL approach. This confirms the robustness of our fully decentralized approach and removes a need of a central server when we train a deep network with FL. Compared with the traditional CLL approach, our DFL also shows a competitive performance. However, we note that training a deep architecture using CLL is less complicated than with SFL or DFL. Furthermore, CLL is not a federated learning approach and does not take into account the privacy of the user data.

ArchitectureLearning MethodUdacityGazeboCarla
DFL (ours)0.1520.0730.244
FADNet (ours)CLL0.1420.0810.303
DFL (ours)0.1070.0690.203

Table 3. Performance comparison of different methods.

Effectiveness of our FADNet.

Table 3 shows that apart from the learning method, the deep architectures also affect the final results. This table illustrates that our FADNet combined with DFL outperforms DroNet in all configurations. We notice that DroNet achieves competitive results when being trained with CLL. However DroNet is not designed for federated training, hence it does not achieve good accuracy when being trained with SFL or DFL. On the other hand, our introduced FADNet is particularly designed with dedicated layers to handle the data imbalance and model convergence problem in federated training. Therefore, FADNet achieves new state-of-the-art results in all three datasets.

Network Topology Analysis.

Table 4 illustrates the performance of DroNet and our FADNet when we train them using DFL under two distributed network topologies: Gaia and NWS. This table shows that the results of DroNet and FADNet under DFL are stable in both Gaia and NWS distributed networks. We note that the NWS topology has 22-silos while the Gaia topology has only 11 silos. This result validates that our FADNet and DFL do not depend on the distributed network topology. Therefore, we can potentially use them in practice with more silo data.

Network TopologyArchitectureUdacityGazeboCarla
Gaia (11 silos)DroNet0.1520.0730.244
FADNet (ours)0.1070.0690.203
NWS (22 silos)DroNet0.1570.0750.239
FADNet (ours)0.1090.0700.200

Table 4. Performance comparison of different network topologies.

Convergence Analysis.

The effectiveness of federated learning algorithms is identified through the convergence ability, including accuracy and training speed, especially when dealing with the increasing number of silos in practice. Fig.3 shows the convergence ability of our FADNet with DFL using two topologies: Gaia with 11 silos, and NWS with 22 silos. This figure shows that our proposed DFL achieves the best results in Gaia and NWS topology and converges faster than the SFL approach in both Gazebo and Carla datasets. We also notice that the performance of our DFL is stable when there is an increase in the number of silos. Specifically, training our FADNet with DFL reaches the converged point after approximately 150150s, 180180s on the NWS and Gaia topology, respectively. Fig.3 validates the convergence ability of our FADNet and DFL, especially when dealing with the increasing number of silos.

In practice, compared with the traditional CLL approach, federated learning methods such as SFL or DFL can leverage more GPUs remotely. Therefore, we can reduce the total training time significantly. However, the drawback of federated learning is we would need more GPUs in total (ideally one for each silo), and deep architecture also should be carefully designed to ensure model convergence.

Figure 3. The convergence ability of our FADNet and DFL under Gaia and NWS topology. Wall-clock time or elapsed real-time is the actual time taken from the start of the whole training process to the end, including the synchronization time of the weight aggregation process. All experiments are conducted with 3,0003,000 communication rounds.


To verify the effectiveness of our FADNet in practice, we deploy the model trained on the Gazebo dataset on a mobile robot. The robot is equipped with a RealSense camera to capture the front RGB images. Our FADNet is deployed on a Qualcomm RB5 board to make the prediction of the steering angle for the robot. The processing time of our FADNet on the Qualcomm RB5 board is approximately 1212 frames per second. Overall, we observe that the robot can navigate smoothly in an indoor environment without colliding with obstacles. More qualitative results can be found in our supplementary material.


We propose a new approach to learn an autonomous driving policy from sensory data without violating the user's privacy. We introduce a peer-to-peer deep federated learning (DFL) method that effectively utilizes the user data in a fully distributed manner. Furthermore, we develop a new deep architecture - FADNet that is well suitable for distributed training. The intensive experimental results on three datasets show that our FADNet with DFL outperforms recent state-of-the-art methods by a fair margin. Currently, our deployment experiment is limited to a mobile robot in an indoor environment. In the future, we would like to test our approach with more silos and deploy the trained model using an autonomous car on man-made roads.

Deep Federated Learning for Autonomous Driving (Part 1)

Autonomous driving is an active research topic in both academia and industry. However, most of the existing solutions focus on improving the accuracy by training learnable models with centralized large-scale data. Therefore, these methods do not take into account the user's privacy. In this paper, we present a new approach to learn autonomous driving policy while respecting privacy concerns. We propose a peer-to-peer Deep Federated Learning (DFL) approach to train deep architectures in a fully decentralized manner and remove the need for central orchestration. We design a new Federated Autonomous Driving network (FADNet) that can improve the model stability, ensure convergence, and handle imbalanced data distribution problems while is being trained with federated learning methods. Intensively experimental results on three datasets show that our approach with FADNet and DFL achieves superior accuracy compared with other recent methods. Furthermore, our approach can maintain privacy by not collecting user data to a central server.

Our source code can be found at: https://github.com/aioz-ai/FADNet

1. Introduction

In this paper, our goal is to develop an end-to-end driving policy from sensory data while maintaining the user's privacy by utilizing FL. We address the key challenges in FL to make sure our deep network can achieve competitive performance when being trained in a fully decentralized manner. Fig.1 shows an overview of different learning approaches for autonomous driving. In Centralized Local Learning (CLL), the data are collected and trained in one local machine. Hence, the CLL approach does not take into account the user's privacy. The Server-based Federated Learning (SFL) strategy requires a central server to orchestrate the training process and receive the contributions of all clients. The main limitation of SFL is communication congestion when the number of clients is large. Therefore, we follow the peer-to-peer federated learningto set up the training. Our peer-to-peer Deep Federated Learning (DFL) is fully decentralized and can reduce communication congestion during training. We also propose a new Federated Autonomous Driving network (FADNet) to address the problem of model convergence and imbalanced data distribution. By training our FADNet using DFL, our approach outperforms recent state-of-the-art methods by a fair margin while maintaining user data privacy.

Figure 1. An overview of different learning methods for autonomous driving. (a) Centralized Local Learning, (b) Server-based Federated Learning, and (c) our peer-to-peer Deep Federated Learning. Red arrows denote the aggregation process between silos. Yellow lines with a red cross indicate the non-sharing data between silos.

Our contributions can be summarized as follows:

  • We propose a fully decentralized, peer-to-peer Deep Federated Learning framework for training autonomous driving solutions.
  • We introduce a Federated Autonomous Driving network that is well suitable for federated training.
  • We introduce two new datasets and conduct intensive experiments to validate our results.

2. Problem Formulation

We consider a federated network with NN siloed data centers (e.g., autonomous cars) Di\mathcal{D}_{i}, with i[1,N]i \in [1,N]. Our goal is to collaboratively train a global driving policy θ\theta by aggregating all local learnable weights θi\theta_i of each silo. Note that, unlike the popular centralized local training setup, in FL training, each silo does not share its local data, but periodically transmits model updates to other silos.

In practice, each silo has the training loss Li(ξi,θi)\mathcal{L}_i(\xi_i, \theta_i). ξi\xi_i is the ground-truth in each silo ii. Li(ξi,θi)\mathcal{L}_i(\xi_i, \theta_i) is calculated as the regression loss. This regression loss is modeled by a deep network that takes RGB images as inputs and predicts the associated steering angles.

3. Deep Federated Learning for Autonomous Driving

A popular training method in FL is to set up a central server that orchestrates the training process and receives the contributions of all clients (Server-based Federated Learning - SFL). The limitation of SFL is the server potentially represents a single point of failure in the system. We also may have communication congestion between the server and clients when the number of clients is massive. Therefore, in this work, we utilize the peer-to-peer FL to set up the training scenario. In peer-to-peer FL, there is no centralized orchestration, and the communication is via peer-to-peer topology. However, the main challenge of peer-to-peer FL is to assure model convergence and maintain accuracy in a fully decentralized training setting.

Figure 2. An overview of our peer-to-peer Deep Federated Learning method. (a) A simplified version of an overlay graph. (b) The training methodology in the overlay graph. Note that blue arrows denote the local training process in each silo; red arrows denote the aggregation process between silos controlled by the overlay graph; yellow lines with a red cross indicate the non-sharing data between silos; the arrow indicates that the process is parallel.

Fig.2 illustrates our Deep Federated Learning (DFL) method. Our DFL follows the peer-to-peer FL setup with the goal to integrate a deep architecture into a fully decentralized setting that ensures convergence while achieving competitive results compared to the traditional Centralized Local Learning or SFL approach. In practice, we can consider a silo as an autonomous car. Each silo maintains a local learnable model and does not share its data with other silos. We represent the silos as vertices of a communication graph and the FL is performed on an overlay, which is a sub-graph of this communication graph.

Designing the Overlay

Let Gc=(V,Ec)\mathcal{G}_c = (\mathcal{V}, \mathcal{E}_c) is the connectivity graph that captures the possible direct communications among NN silos. V\mathcal{V} is the set of vertices (silos), while Ec\mathcal{E}_c is the set of communication links between vertices. Ni+\mathcal{N}_i^{+} and Ni\mathcal{N}_i^{-} are in-neighbors and out-neighbors of a silo ii, respectively. As in~\cite{marfoq2020throughput}, we note that it is unnecessary to use all the connections of the connectivity graph for FL. Indeed, a sub-graph called an overlay, Go=(V,Eo)\mathcal{G}_o = (\mathcal{V}, \mathcal{E}_o) can be generated from Gc\mathcal{G}_c. In our work, Go\mathcal{G}_o is the result of Christofides’ Algorithm~\cite{monnot2003approximation}, which yields a strong spanning sub-graph of Gc\mathcal{G}_c with minimal cycle time. One cycle time or time per communication round, in general, is the time that a vertex waits for messages from the other vertices to do a computational update.

In practice, one block cycle time of an overlay Go\mathcal{G}_o depends on the delay of each link (i,j)(i, j), denoted as do(i,j)d_o(i, j), which is the time interval between the beginning of a local computation at node ii, and the receiving of ii's messages by jj. Furthermore, without concerns about access links delays between vertices, our graph is treated as an edge-capacitated network with:

do(i,j)=s×Tc(i)+l(i,j)+MB(i,j)d_o(i,j) = s \times T_c(i) + l(i,j) + \frac{M}{B(i,j)}

where Tc(i)T_c(i) is the time to compute one local update of the model; ss is the number of local computational steps; l(i,j)l(i,j) is the link latency; MM is the model size; B(i,j)B(i,j) is available bandwidth of the path (i,j)(i,j). As in~\cite{marfoq2020throughput}, we set s=1s=1.

Training Algorithm

At each silo ii, the optimization problem to be solved is:

θi=argminθiEξDi[L(ξi,θi)]\theta_i^{*} = \underset{\theta_i}{\arg\min} \underset{\xi \sim \mathcal{D}_i}{\mathbb{E}}[\mathcal{L}(\xi_i, \theta_i)]

We apply the distributed federated learning algorithm, DPASGD, to solve the optimizations of all the silos. In fact, after waiting one cycle time, each silo ii will receive parameters θj\theta_j from its in-neighbor Ni+\mathcal{N}_i^{+} and accumulate these parameters multiplied with a non-negative coefficient from the consensus matrix A\mathbf{A}. It then performs ss mini-batch gradient updates before sending θi\theta_i to its out-neighbors Ni\mathcal{N}_i^{-}, and the algorithm keeps repeating. Formally, at each iteration kk, the updates are described as:

θi(k+1)={jNi+iAi,jθj(k), if k0(mods+1),θi(k)αk1mh=1mL(θi(k),ξi(h)(k)),otherwise.\theta_{i}\left(k + 1\right) = \begin{cases} \sum_{j \in \mathcal{N}_i^{+} \cup{i}}\textbf{A}_{i,j}{\theta}_{j}\left(k\right), \textit{ if k} \equiv 0 \pmod{s + 1},\\ {\theta}_{i}\left(k\right)-\alpha_{k}\frac{1}{m}\sum^m_{h=1}\nabla \mathcal{L}\left({\theta}_{i}\left(k\right),\xi_i^{\left(h\right)}\left(k\right)\right), \text{otherwise.} \end{cases}

where mm is the mini-batch size and αk>0\alpha_k > 0 is a potentially varying learning rate.

Federated Averaging

To compute the prediction of models in all silos, we compute the average model θ\theta using weight aggregation from all the local model θi\theta_i. The federated averaging process is conducted as follow:

θ=1i=0Nλii=0Nλiθi\theta = \frac{1}{\sum^N_{i=0}{\lambda_i}} \sum^N_{i=0}\lambda_{{i}} \theta_{{i}}

where NN is the number of silos; λi={0,1}\lambda_i = \{0,1\}. Note that λi=1\lambda_i = 1 indicates that silo ii joins the inference process and λi=0\lambda_i = 0 if not. The aggregated weight θ\theta is then used for evaluation on the testing set Dtest\mathcal{D}_{test}.

4. Network Architecture

One of the main challenges when training a deep network in FL is the imbalanced and non-IID (identically and independently distributed) problem in data partitioning across silos. To overcome this problem, the learning architecture should have an appropriate design to balance the trade-off between convergence ability and accuracy performance. In practice, the deep architecture has to deal with the high variance between silo weights when the accumulation process for all silos is conducted. To this end, we design a new Federated Autonomous Driving Network, which is based on ResNet8, as shown in Fig.3.

Figure 3. Human Tracking.

In particular, our proposed FADNet first comprises an input layer normalization to improve the stability of the abstract layer. This layer aims to handle different distributions of input images in each silo. Then, a convolution layer following by a max-pooling layer is added to encode the input. To handle the vanishing gradient problem, three residual blocks are appended with a following FC layer to extract ResBlock features. However, using residual blocks increases the variance of silo weights during the aggregation process and affects the convergence ability of the model. To address this problem, we add a Global Average Pooling layer (GAP) associated with each residual block. GAP is a non-weight pooling layer which sums out the spatial information from each residual block. Thus, it is not affected by the weighted variance problem. The output of each GAP layer is passed through an Accumulation layer to accrue the Support feature. The ResBlock feature and the Support feature from GAP layers are fed into the Aggregation layer to calculate the model loss in each silo.

In our design, the Accumulation and Aggregation layers aim to reduce the variance of the global model since we need to combine multiple model weights produced by different silos. In particular, the Accumulation layer is a variant of the fully connected (FC) layer. Instead of weighting the contribution of input nodes as in FC, the Accumulation layer weights the contribution of multiple features from input layers. The Accumulation layer has a learnable weight matrix wRnw \in \mathbb{R}^\text{n}. Its number of nodes is equal to the \text{n} number of input layers. Note that the support feature from the Accumulation layer has the same size as the input. Let F={f1,f2,...,fn},fhRdF = \{f_\text{1}, f_\text{2}, ..., f_\text{n}\}, \forall f_\text{h} \in \mathbb{R}^\text{d} be the collection of n\text{n} number of the features extracted from n\text{n} input GAP layers; d\text{d} is the unified dimension. The Accumulation outputs a feature fcRdf_\text{c} \in \mathbb{R}^\text{d} in each silo ii, and is computed as:

fc=Accumulation(F)i=h=1n(whfh)if_\text{c} = Accumulation(F)_i = \sum^{\text{n}}_{\text{h}=1}(w_\text{h}f_\text{h})_i

The Aggregation layer is a fusion between the ResBlock feature extracted from the backbone and the support feature from the Accumulation layer. For simplicity, we use the Hadamard product to compute the aggregated feature. This feature is then averaged to predict the steering angle. Let fsRdf_\text{s} \in \mathbb{R}^\text{d} be the ResBlock features extracted from the backbone. The output driving policy θi\theta_i of silo ii can be calculated as:

θi=Aggregation(fs,fc)i=(fsfc)ˉi\theta_i = Aggregation(f_\text{s}, f_\text{c})_i = \bar{(f_\text{s} \odot f_\text{c})}_i

where \odot denotes Hadamard product; ()ˉ\bar{(*)} denotes the mean and we set d=6,272\text{d} = 6,272.


In the next post, we will show the effectiveness and efficiency of FADNet during Federated Learning proccess.

Federated Learning, Challenges and Hot Trends for research

Federated learning and its remaining challenge

Federated learning is the process of training statistical models via a network of distant devices or siloed data centers, such as mobile phones or hospitals, while keeping data locally. In terms of federated learning, there are five major obstacles that have a direct impact on the paper publishing trend.

1. Expensive Communication

Due to the internet connection, huge number of users, and administrative costs, there is a bottleneck in communication between devices and server-devices.

2. Systems Heterogeneity

Because of differences in hardware (CPU, RAM), network connection (3G, 4G, 5G, wifi), and power, each device in federated networks may have different storage, computational, and communication capabilities (battery level).

3. Statistical Heterogeneity

Devices routinely create and collect data in non-identically dispersed ways across the network; for example, in the context of a next word prediction, mobile phone users employ a variety of languages. Furthermore, the quantity of data points on different devices may differ greatly, and an underlying structure may exist that describes the interaction between devices and their related distributions. This data generation paradigm violates frequently-used independent and identically distributed (I.I.D. problem) assumptions in distributed optimization, increases the likelihood of stragglers, and may add complexity in terms of modeling, analysis, and evaluation.

4. Privacy Concerns

In federated learning applications, privacy is a crucial problem. Federated learning takes a step toward data protection by sharing model changes, such as gradient information, rather than the raw data created on each device. Nonetheless, transmitting model updates during the training process may divulge sensitive information to a third party or the central server.

5. Domain transfer

Not any task can be applied to the federated learning paradigm to finish their training process due to the aforementioned four challenges.

Hot trends

Data distribution heterogeneity and label inadequacy.

  • Distributed Optimization
  • Non-IID and Model Personalization
  • Semi-Supervised Learning
  • Vertical Federated Learning
  • Decentralized FL
  • Hierarchical FL
  • Neural Architecture Search
  • Transfer Learning
  • Continual Learning
  • Domain Adaptation
  • Reinforcement Learning
  • Bayesian Learning

Security, privacy, fairness, and incentive mechanisms:

  • Adversarial-Attack-and-Defense
  • Privacy
  • Fairness
  • Interpretability
  • Incentive Mechanism

Communication and computational resource constraints, software and hardware heterogeneity, the FL system

  • Communication-Efficiency
  • Straggler Problem
  • Computation Efficiency
  • Wireless Communication and Cloud Computing
  • FL System Design

Models and Applications

  • Models
  • Natural language Processing
  • Computer Vision
  • Health Care
  • Transportation
  • Recommendation System
  • Speech
  • Finance
  • Smart City
  • Robotics
  • Networking
  • Blockchain
  • Other

Benchmark, Dataset, and Survey

  • Benchmark and Dataset
  • Survey

Introduction to Federated Learning

Federated Learning: machine learning over a distributed dataset, where user devices (e.g., desktop, mobile phones, etc.) are utilized to collaboratively learn a shared prediction model while keeping all training data locally on the device. This approach decouples the ability to do machine learning from storing the data in the cloud.

Conceptually, federated learning proposes a mechanism to train a high-quality centralized model. Simultaneously, training data remains distributed over many clients, each with unreliable and relatively slow network connections.

The idea behind federated learning is as conceptually simple as its technologically complex. Traditional machine learning programs relied on a centralized model for training in which a group of servers runs a specific model against training and validation datasets. That centralized training approach can work very efficiently in many scenarios. Still, it has also proven to be challenging in use cases involving a large number of endpoints using and improving the model. The prototypical example of the limitation of the centralized training model can be found in mobile or internet of things(IoT) scenarios. The quality of a model depends on the information processed across hundreds of thousands or millions of devices. Each endpoint can contribute to a machine learning model's training in its own autonomous way in those scenarios. In other words, knowledge is federated.

Blockchain: large, distributed dataset, where no-one can edit/delete an old entry, nor fake a new entry. The data is enforced by fundamental limits of computations (i.e., Proof of Work).

Smart Contract: dataset stored on the Blockchain, which includes: Data (i.e., ledgers, events, statistics), State (today's ledger, today's events), Code (rules for changing state).

Math Examples

Fundamental Theorem of Calculus Let f:[a,b]Rf:[a,b] \to \R be Riemann integrable. Let F:[a,b]RF:[a,b]\to\R be F(x)=axf(t)dtF(x)= \int_{a}^{x}f(t)dt. Then FF is continuous, and at all xx such that ff is continuous at xx, FF is differentiable at xx with F(x)=f(x)F'(x)=f(x).

Lift(LL) can be determined by Lift Coefficient (CLC_L) like the following equation.

L=12ρv2SCLL = \frac{1}{2} \rho v^2 S C_L


[1] J.Rodriguez. Whats New in Deep Learning Research: Understanding Federated Learning.