Reducing Training Time in Cross-Silo Federated Learning using Multigraph Topology (Part 2)
In previous paart, we how explore decentralized federated learning, how and why multigraph is proposed to improve training process. In this part, we will investigate that how delay time and cycle time is affected by the modification of multigraph in the design of the topology. Also, we will explore how multigraph can be constructed.
Our code can be found at: https://github.com/aioz-ai/MultigraphFL
1. Delay time in multigraph
A delay to an edge is the time interval when node receives the weight sending by node , which can be defined by:
where denotes the time to compute one local update of the model; is the number of local updates; is the link latency; is the model size; is the total network traffic capacity. However, unlike other communication infrastructures, the multigraph only contains connections between silos without other nodes such as routers or amplifiers. Thus, the total network traffic capacity where and denote the upload and download link capacity. Note that the upload and download processes can happen in parallel.
Since multigraph can contain multiple edges between two nodes, we extend the definition of the delay in the previous equation to , with is the -th communication round during the training process, as:
where is weakly-connected edge, is strongly-connected edge; $ \tau_k(\mathcal{G}_m)$ is the cycle time at the -th computation round during the training process.
In general, using introduced equation, \textit{the delay of the next communication round is updated based on the delay of the previous rounds} and other factors, depending on the edge type connection.
2. Cycle time in multigraph
The cycle time per round is the time required to complete a communication round. In this work, we define the cycle time per round is the maximum delay between all silo pairs with strongly-connected edges. Therefore, the average cycle time of the entire training is:
where is an in-neighbors silo set of whose edges are strongly-connected.
3. Multigraph Construction
Algorithm 1 describes our methods to generate the multigraph with multiple edges between silos. The algorithm takes the overlay as input. Similar to~\cite{marfoq2020throughput}, we use the Christofides algorithm to obtain the overlay. In Algorithm 1, we establish multiple edges that indicate different statuses (strongly-connected or weakly-connected). To identify the total edges between a silo pair, we divide the delay by the smallest delay overall silo pairs, and compare it with the maximum number of edges parameter ( in our experiments). \textit{We assume that the silo pairs with longer delay will have more weakly-connected edges, hence potentially becoming isolated nodes}. Overall, we aim to increase the number of weakly-connected edges, which generate more isolated nodes to speed up the training process. Note that, from Algorithm 1, each silo pair in the multigraph should have one strongly-connected edge and multiple weakly-connected edges. The role of the strongly-connected edge is to make sure that two silos have a good connection in at least one communication round.
Next
In the next post, we will mention multigraph parsing proccess and how to train a multigraph under decentralized federated learning.