Proposal for Nano Node Network Optimizations – Srayman
The primary mechanism for distributing blocks and votes across the Nano network utilizes a gossip protocol where each node sends a block or vote to a subset of other nodes. These nodes then also distribute the same message again until all nodes have received the message. This protocol is what enables blocks and votes to propagate extremely quickly to all nodes and to reach quorum in less than one second. The confirmation per second rate can be modeled as a function of the number of total nodes, number of principal representatives (PR), hardware and bandwidth as well as the node software.
After seeing test results in V19.0 and V20.0 Beta and also analyzing vote data on the live network I believe there are a few areas that could be optimized in the node software by differentiating nodes hosting a PR from other nodes. These optimizations could lead to a 80-90% reduction in vote traffic sent by a PR node, faster confirmation times and higher bandwidth requirements for nodes attempting to spam the network.
Utilizing the vote websocket it appears that votes are duplicated quite a few times as a consequence of using the gossip protocol. The same is true for blocks, during V20 testing when the nodes were not saturated during 200,000 block tests it was consistent that there were 1.2M publish in/out messages for blocks which matches the expected count based on the sqrt(peers) gossip fanout being 6. On the live network with ~325 nodes and 85–90 online principal representatives (as of September 2019) vote count distribution looks like the the graph below.
Note: This does not include replay votes, duplicate votes seen while the block confirmation is still active, which would increase the duplicate count even further. Stats on V20 live network show replay votes are about 22% of the total vote count, however in reality replay votes are closer to 88% of the total vote traffic, due to elections being removed immediately after confirmation. Votes received after the block is confirmed are not being counted properly as replay votes. This does not impact overall network traffic, but is an interesting observation.
On the beta network with fewer nodes the duplicates are less pronounced but shows a trend towards more duplication as the number of nodes increases. I tested vote counts by running additional nodes and repeating the test for 40 blocks in a single account send/receive. The following charts are the base level of 20 peers, then adding 10, 20 and 40 additional nodes to the network. You can see a trend with the duplicates drifting towards the example on the live network.
Proposal #1 — Increase Initial Fanout and Reduce Rebroadcasting Blocks
This recommendation is to increase the initial block fanout only from the node generating blocks (sending/receiving). On initial block broadcast the node should send to all PR nodes, or a random subset of PR nodes that is sufficiently high so that blocks are propagated to PR nodes quickly. At the same time the rebroadcasting of blocks should be reduced/optimized to reduce bandwidth requirements for other nodes that are rebroadcasting blocks.
Advantages:
- Gets the block to the voting nodes as fast as possible in a more ordered fashion than the current random sqrt(peers), when blocks are distributed to voting nodes in order they can process them more efficiently.
- Reduces overall network traffic since this is more optimized rather than creating as many duplicate block messages as described in the vote analysis above.
- Limits potential for Sybil nodes to censor blocks as they are sent to PR nodes directly.
- Places the burden of sending/receiving bandwidth on the node doing the transmission. While not the primary reason, I see this as an added benefit to limit or disincentivize spammers as they would now be responsible for a larger portion of the bandwidth requirements to broadcast the blocks across the network.
Disadvantages:
- Requires more bandwidth on the sending/receiving node.
- If rebroadcasting is not optimized, results in higher overall bandwidth on the network.
Proposal #2 — Nano Node Network Structure Changes
The current network structure for live network traffic is what I would refer to as a hybrid mesh network. In a fully mesh network all nodes would communicate with every other node, but this is optimized more in the Nano node software to be limited to sqrt(peers) to reduce network traffic while still maintaining redundancy.
Advantage: Redundancy and quick transmission to all nodes
Disadvantage: Unwanted duplicate network traffic
The PR nodes are the backbone of the network. Only their votes are rebroadcast by other nodes and non voting (Non-PR) nodes votes have very little impact on quorum. Therefore the PR nodes should be prioritized for both blocks and voting efficiency.
The most efficient form of network topology is a ring network where each node is connected to another node in a ring. However this is also the most prone to data loss if a node is disconnected in the ring. This proposal is to use a ring-mesh-star hybrid topology with some adjustments to how blocks and votes are initially broadcast and rebroadcast. This assumes the changes in Proposal #1 are implemented for initial block broadcast.
The mesh portion would be a combination of how PR’s communicate votes with each other and how blocks are initially broadcast from the sending/receiving node. For example, a publishing node would distribute blocks to all PR nodes. Each PR would broadcast it’s vote for a block to every other PR node and a subset of Non-PR nodes. PR nodes would no longer rebroadcast other PR votes, however Non-PR nodes would continue to republish votes but at a reduced fanout compared to what it is now that would only be targeted at other Non-PR nodes excluding all PR nodes since they should already have all votes.
The ring portion of the network would be comprised of the PR nodes and would be structured in a dual ring topology where network traffic would flow in both directions between the PR nodes. This would ensure redundancy if a link is broken as well as ensure that every PR node eventually receives a block. While the initial fanout from a node would attempt to hit all PR nodes, they may not all be in the nodes peer list or other network issues may arise that cause them not to get the initial broadcast. So each PR should broadcast a block to its neighbors in either direction on the ring. This could either be structured by ip address/port (which could in theory help with latency) or by node_id where a PR node sends to both its neighbors in either direction. There needs to be a mechanism to stop the loop so that blocks are not repeatedly broadcast around the loop, this should be handled by only republishing a block for one round when it’s first received and ignoring duplicates.
The star/hybrid portion would be each PR node connected to a set of Non-PR nodes with some overlap for redundancy. Since PR nodes would be prioritized with the above changes, network traffic between PR nodes could be reduced by not having the sqrt(peers) fanout be targeted at PR nodes but only Non-PR nodes. When Non-PR nodes receive a block or vote they could rebroadcast to other Non-PR nodes but exclude PR nodes to limit the number of duplicate packets sent to PR nodes.
Mathematical Models
Block Fanout: The current fanout for block and vote messages is sqrt(peers). Block publish in/out count can be modeled as the following where block count = b.
b(x) = b * sqrt(peers)
Vote Fanout: Let’s assume a situation where saturation has not occurred, which reduces overall votes via confirmation height etc. Each PR node produces votes for every block and they are packaged in batches, up to a max of 12 votes at a time. Therefore during peak transaction volume just before saturation, vote in/out count can be modeled as the following where block count = b, PR count = a.
f(x) = (b * a * sqrt(peers)) / 12
PR to all Nodes— No Fanout: Another scenario to consider is the ratio of PR nodes to Non-PR nodes. It’s apparent from the function above that increasing PR counts significantly increases total vote messages, so one noteworthy analysis point is to consider each PR node sending their votes to every node on the network as opposed to the current fanout methodology. This would eliminate any rebroadcasting of votes, but long term would put more strain on PR nodes. The votes generated by a single PR node and sent to all nodes on the network could then be modeled as the following.
p(x) = (b * peers) / 12
The graph below shows this convergence with 60 PR nodes. The dotted purple line is p(x) above and the solid green line is f(x) above. Before reaching 3600 total nodes it’s actually more efficient to have PR nodes only send to every other node and not do any rebroadcasting of votes. After 3600 nodes it becomes more efficient long term to use a gossip style fanout as described with f(x). This convergence can easily be computed as the following where PR count = a.
c(x) = a²
There are several considerations that need to be balanced. Obviously in the long term, PR’s sending their votes directly to other nodes is not scalable as the node count increases, however the current fanout is also not as scalable as it could be. There are also 2 types of nodes, PR and Non-PR, that serve different functions. PR nodes also get votes requested directly from other nodes bootstrapping, or a node that missed votes for some reason that needs to reach quorum that all add to the load on PR nodes. PR nodes also must perform extra processing to generate the votes in the first place, so priority #1 should be to reduce PR load and priority #2 should be to reduce Non-PR load.
Proposals: The proposals described earlier provide changes that could both reduce PR and Non-PR vote messages with PR vote messages reduced by 80-90% compared to the current fanout and Non-PR reduced by 50%+. The model for PR nodes becomes the following where block count = b and PR count = a.
d(x) = (b * (a + sqrt(peers-a))) / 12
This could also be modified to reduce load on other nodes, instead of Non-PR nodes performing a regular fanout they could use a reduced fanout if the PR distribution to Non-PR nodes is doubled or tripled. That would make the PR model become the following. A gossip protocol simulator shows this can achieve distribution in as many hops as regular fanout but reduces the load on Non-PR’s by 50%.
d(x) = (b * (a + 2*sqrt(peers-a))) / 12
For Non-PR nodes the fanout would be reduced in half instead of the vote fanout f(x) above it would be the following where block count = b and PR count = a.
d2(x) = (b * a * 0.5 * sqrt(peers-a)) / 12
Example Data Flow
Let’s assume the network has 300 nodes and 60 PR nodes, this example describes the block and vote propagation for 1 block. There are backup processes, for example bootstrapping, if a node misses a block somehow and requesting votes directly from PR nodes if a vote is missed, but these changes do not impact those processes at all and focus only on live network traffic.
Current V20.0 Data Flow
- Node A publishes a block to sqrt(peers) other nodes.
- All PR and Non-PR nodes that receive the block process it and republish to sqrt(peers) other nodes.
- All PR nodes generate a vote for the block and publish the vote to sqrt(peers) other nodes.
- All PR and Non-PR nodes that see a vote process it and republish the vote to sqrt(peers) other nodes. The PR factor is multiplicative in this case where each new PR is multiplied by the total messages rebroadcast, compared to being additive after the proposed changes for PR nodes.
Proposed Data Flow
- Node A publishes a block to all PR nodes. (eg 60)
- All PR nodes process the block and send to their PR neighbors in either direction and to sqrt(peers-PR) eg. sqrt(300–60) or 16 Non-PR nodes.
- All PR nodes generate their vote and publish it to every other PR node and to sqrt(peers-PR) eg. sqrt(300–60) or 16 Non-PR nodes. (Key takeaway, PR nodes do not republish other PR votes)
- Non-PR nodes receive the block, process it and republish to sqrt(peers-PR) or 16 Non-PR nodes.
- Non-PR nodes receive the votes, process them and republish to sqrt(peers-PR) or 16 Non-PR nodes.
Step 3 could be changed to have PR’s publish to more than 16 nodes, for example double (32) and have steps 4–5 be reduced to broadcast to half sqrt or 8 nodes instead. Or if a DHT is setup for Non-PR nodes and PR’s are given a subset of nodes to publish their messages too then the need for Non-PR nodes to republish as much could be significantly reduced. See the following gossip simulator to see the number of hops it takes to disseminate the messages to all nodes with increased original fanout and reduced republishing. https://nano-faucet.org/simulator/
The examples below assume 300 Nodes, 60 PR Nodes and 200,000 blocks published by a single node. Vote messages are calculated assuming the max 12 hashes per vote which is reached at about 60 confirmations per second. (eg 200,000 blocks * 60 PR’s * 18 fanout / 12 hashes).
Note: Non-PR messages could be reduced significantly further if a more optimized minimum spanning tree or Kademlia style DHT were implemented on top of the proposals here to optimize rebroadcasting of blocks and votes.
Published at Sat, 04 Jan 2020 16:34:26 +0000
{flickr|100|campaign}
