Extensible Telemetry Interface to Detect Unhealthy Vaults (Potential RFC Idea)

This post is intended to be sort of a ‘living document’ and should be modified as needed. Please provide any questions, thoughts, or suggestions you might have :slight_smile:

I had been browsing around earlier looking for info on profiling tools, and while some ideas are bouncing around, it seems there’s not too much out there right now. I believe that it would be beneficial for us to establish an interface for us query vault health as a means of maximizing the overall network health and to provide the tooling necessary for future work in in-depth debugging and profiling.

As a possible solution, I thought maybe some sort of extensible, telemetry-style interface for querying vaults might be a good solution to consolidate profiling efforts and offer an implementation for health checks, which have been proposed in the past.

This topic is for discussing an extensible telemetry implementation which allows vaults to remotely query another vaults’s health and/or supplementary information. In this discussion, the primary concern is the interface structure and an initial implementation allowing the network to accept/reject/evict vaults based on “health” That said, the interface should be extensible to allow for other supplementary information (e.g. performance info, structured-data logging, etc.) to be queried, which could pave the way for further optimizations and debugging tools down the line.

The existing solution is not integrated into the safe-vault library and is mostly a stop-gap solution by its own admission (see resource proof below under background & related Info.

Goals

  • Introduce a unified, easily-extensible interface for querying vaults for information to determine their fitness level and potentially other supplementary performance or diagnostic information
  • Allow vaults to be compared to each other and to arbitrary standards based on health
  • Define one or several metric(s) to define what it means for a network node to be “healthy” enough to participate in the network (separate from proof of resource; think cpu, bandwidth, etc.)
  • Define behavior for nodes in/attempting to join the network which are not “healthy.”

Non-Goals

  • Defining supplementary debug information to be supplied over the interface – these should be proposed separately if/when the proposed system is implemented, so that there is the appropriate focus to consider things like security/overhead of each additional query type
  • Defining further behaviors beyond vault rejection/acceptance/eviction, based on the health metric. For the same reason as the above point, this should probably go in different topics

Background and Related Information

  • resource_proof - A rust crate which implements best-effort checking of node resources to provide “some indication that a machine has some capabailities”
  • previous discussion - A discussion that lead to the creation of this topic and discusses some possible use cases and background info
  • Github - Performance Checks Issue - A github issue which alludes to a similar, unformalize version of the above demonstrating interest as well as highlighting some advantages of an integrated solution over resource_proof

Current Ideas

Some of these are borrowed from existing documents, some I’ve come up with.

  • Define health as the “ability to securely store and deliver data” as per Safe Network Health Metrics Document with the idea that this is not a binary value, but rather a continuously valued function (of bandwidth and cpu speed? see Open Questions below).
  • Telemetry would take the form of a new Rpc variant in safe-vault and handled as a routing message, and the response would be dispatched via an Action. Alternatively, retooling the existing Request variant is possible, but would require some more sweeping changes as the Request variant seems to refer specifically to get requests on network resources/data like files
  • The RPC indicates what log page to fetch. The log page can be fetched from some existing memory or constructed on the fly.
  • Health checks are performed on entering the section and at random intervals to ensure continued health.
  • Only Elders can perform a health check of vaults in the same section under normal circumstances.
  • In internal testnets and the like (e.g. testnets where security is not paramount as Maidsafe owns all the vaults in the testnet or the net is entirely local), the above can be relaxed to any node can request a log page from any other node for the purposes of debugging and performance analysis.
  • Similarly to the above, in internal testnets, we could expose conditionally-compiled log pages for unsecure but useful debugging/profiling information. Naturally vaults compiled in this testing mode aren’t interoperable with normal vaults

Potential Future Applications

Some of these aren’t necessarily secure, but are potentially useful in testnet scenarios and for debugging (implying conditional compilation)

  • Collecting structured-data logs in a circular buffer and delivering them on request
    • E.g. providing are the last n messages this vault signed and the messages they were attached to
    • E.g. providing a list of the previous n sections this vault has been a member of
  • Collecting and aggregating plaintext logs from various vaults in a quick and easy fashion
  • Tracking the route of a message as it travels through the network

Open Questions

  • What are good indications of health? Cpu and bandwidth come to mind (see git issue), but I’m certain others exist. Potentially things like disc-space (beyond proof-of-resource required to be a vault in the first place)
  • How to measure health remotely and trust the response
    • Issue a challenge of some sort (e.g. hash these numbers for me) and give an arbitrary time limit to respond?
5 Likes

Hmm, I’ve been thinking about

Which seems, at first brush, to be a bit challenging. I think I’m going to amend that above though, and I think it may be possible to remove the verifiable requirement entirely (I’ll get into why in a moment). Or if you want to skip the wall of text/reasoning, just skip to the TL;DR section.

I don’t have a mathematically rigorous proof, but I think the logic is sound. Let me know what you think.

Reasoning

The reason is either you have a test that is verifiable but not granular or a test that is granular but not (easily) verifiable.

Example:

  • Case 1: Verifiable but not granular – Provide a vault with a million numbers and ask it to hash them
  • Case 2: Not verifiable but granular – Outright ask the vault for a resource measure. Let’s say the average CPU utilization, as a percentage, during the last hour of running the vault software. In real life, maybe some other auxiliary context would be needed like number of cores or clock speed, but this is enough for now.

Now let’s take a look at how each example case learns something about the vault and describe why it fits (or doesn’t fit) the criteria of verifiable and granular. Then we’ll draw conclusions about the pros and cons of each.

  • Case 1
    • Information Learned: We know something about the CPU’s ability to compute something in a given time. This means that the vault has some capabilities.
    • Verifiable? Yup, we just need to hash the numbers ourselves and compare
    • Granular? Not really… We cannot partition the time taken to process those hashes. Especially in the case of failure. Was the node just busy doing other vault-related things (we can’t expect a vault to drop everything and just start hashing things because we want it to). Was the vault’s internet dodgy? Was the CPU too slow to compute the hashes?
  • Case 2
    • Information Learned: A quantitative and consistent estimate of how much strain the vault software puts on the CPU.
    • Verifiable? Not in the slightest. The vault could straight-up lie to us
    • Granular? Yea. Unlike in the previous case, we have a much better idea what we’re getting here

So with that out of the way, it seems we need to relax one of the requirements (either granularity or verifiability), because I can’t immediately conceive of a simple (read: “doesn’t take ‘much’ time or require consensus algorithms like PARSEC and friends”) way to achieve both simultaneously.

To make that decision, I looked at the other goals and how relaxing either of the two requirements could still allow us to achieve those. In particular, I looked at the following:

This point seems to be coupled tightly with the concept of granularity. In the existing resource_proof crate, we have a similar situation to case one. We are verifying that the vault has some capabilities. But non-granular solutions are also not easily compared or quantifiable. It seems that in order to maintain the promise of comparability, we would have to maintain granularity. Thus, the goal shifts to how to relax the verifiable requirement.

With the sort of luck that’s rarely present in engineering, I think that we can actually relax the requirement entirely without changing anything other than the promises that we make when querying a vault.

Let’s imagine a scenario like case 2 above, and let’s just let assume vaults can lie.

I’ll quickly mention the trivial case of a testnet. In the testnet scenario, where Maidsafe owns the vaults or the vaults are strictly local, this isn’t an issue, and we can still reliably collect data for debug and profiling because we are guarenteed there are no bad actors.

In the live network scenario, we look at what happens when a vault lies. What negative impacts happen on the network level and how we can account for it.

  • Case a: vault has sufficient resources and claims fewer resources than it has
    • In this case, a vault would be evicted wrongly
    • Effect: There’s not much harm in this in general, as the network is redundant and tolerant of churn generally
  • Case b: vault has insufficient and claims to have more resources than it does
    • Then the node would be given more responsibility than it could handle
    • Eventually the node would cease to reply properly or store chunks, and the a large number of failed requests would eventually cause the node’s rank to decrease anyway and it would be evicted
    • This may skew profiling efforts later on to assess average section health, but would require a number of outliers in the section that are claiming more resources than they have to skew the results. Further these liars would need to have not been yet evicted from the network

From what we established in the beginning that the health check should only be used to evict unfit vaults, claiming more resources than you have eventually leads to eviction, and claiming fewer than you have leads to eviction also. There is little benefit to lying about vault resources when asked, so there isn’t much incentive (if any, afaik) to lie.

With this framework in mind, we don’t have to really “account” for anything since the behavior of lying is naturally disincentivized, and, even in the presence of liars, they would eventually be filtered out anyway by virtue of the fact that lying can only be deterimental to a node’s reputation. A node having good health is neutral, but a node having poor health is negative. Vaults are only compared to each other for the purpose of profiling, but not for gaining any benefit in terms of their place or responsibilities in the network.

TLDR

It’s difficult to come up with rigorous tests for health which are granular and verifiable. We can probably get away by relaxing the “verifiable” requirement though, which simplifies things massively.

There is no little-to-no benefit to lying because the only thing that the data is used for on a network level, beyond profiling, is evicting nodes. At best, a vault which lies about it’s resources will have net zero gain (no incentive to lie), and, at worst, it allows itself to be prematurely evicted (disincentive to lying).

In this way, we can actually have health checks simply ask naively for the node’s estimate of the resource, and we have reasonable guarantee that the node won’t lie because there’s not much reason to.

1 Like

How about using the health check to know when to bring in more resources rather than (or as well as) when to evict?

1 Like

There’s so much to say in this topic, thanks for starting it! I’ll try my best to just take small bites because otherwise it becomes a bit of a mental quagmire.

What are good indications of health?

How much spare capacity is available. This must be presented in context of existing workloads, both the baseload and expected peaks. Health is not just about how the vault is currently performing. It’s about resilience under stress which is an uncommon mode of operation that we are trying to predict and account for. So I feel health is very complex from this perspective.

I feel like measuring how well something is currently performing is only part of the story. Not to say it’s pointless but the bigger picture of resilience from spare capacity is maybe more important.

It’s like measuring everyone’s biometrics vs measuring nuclear proliferation, we need to account for both our periodic individual health and our broader existential risks. They both count towards health.

Case 1 [hash a million numbers] Verifiable? Yup, we just need to hash the numbers ourselves and compare

I know you are only presenting an example, but to address a more general point (which I think you are well aware of), I feel like we must aim for efficiency here. A ratio of work at 1:1 elder:node for validity checks is going to put too much strain on the elders. Better for the elder to do almost-zero-work (eg generate a random seed) and the node show how much work they can do with it, producing an almost-zero-work verifiable output (eg elders can check that the seed and response hash together to give a result over a certain threshold, which only needs a single hash of work to verify).

To maintain granularity, why not use a granular activity such as responding to GET. Just throwing out an idea here maybe it’s a bad one; every response from a node includes a short proof of work which returns the smallest result they find between receiving the request and sending the response. The balancing act here is to not take too long to respond with a great proof but not respond too quickly with a bad proof either. The node aims to respond with the smallest (hash value * response time). Take too long and a smaller hash becomes worthless. Respond quickly with a large hash value and it negates the response time. This is too exposed to luck and latency effects, and I think it needs to be based on median efforts rather than best efforts, but hopefully this shows some way to achieve granularity through frequency. Mainly I worry that it’s unnecessarily inefficient.

I really like the idea of rewardable cache to achieve granular measurements. It has the side effect of vaults indicating to the network how much spare space they have. It’s very granular and zero cost.

Case a: vault has sufficient resources and claims fewer resources than it has … a vault would be evicted wrongly

we don’t have to really “account” for anything since the behavior of lying is naturally disincentivized

There is no little-to-no benefit to lying because the only thing that the data is used for on a network level, beyond profiling, is evicting nodes.

One missing effect here is that more new vaults would be joining the network which poses a threat to consensus. Mainly I’m trying to point out that eviction is not the only consideration. Being able to affect the new vault join rules is a potentially powerful lever to be able to pull. We don’t want anyone to be able to flood the network with new vaults. I am personally not convinced of this threat and am in favor of very open membership rules, but there’s a substantial history about it in analysing the google attack. Health depends on the distribution of nodes, who owns them, what are their motives and preferences, and what is the distribution of node age in each section.

Accurately knowing the health lets the network know how much reward is needed to stimulate participation, how much it should cost to store data, when nodes should be allowed to join the network, the correlation between increased load and increased failures. Lying vaults impact the accuracy of the control mechanism for these particular things. A control mechanism depends on the quality of inputs. It can survive messy inputs for sure, that’s the point of it, but where are the boundaries and when does it become no longer useful or move into dangerous actions from the control mechanism? eg we wouldn’t want the reward mechanism gamed because of vaults lying about their health. If health only affected eviction then sure, lying is fine, but eviction has secondary effects on these other aspects.

A node having good health is neutral, but a node having poor health is negative.

I feel like this misses the distinction about handling periods of high stress. A very healthy node is good (not neutral) because it allows a larger range of potential future workloads than a merely neutral or negative health node. Health is unfortunately tied to the uncertainty of future workloads, not just the past and current node metrics.

We can probably get away by relaxing the “verifiable” requirement though

Yes I think this is a good path to explore.

2 Likes

I appreciate the input. You bring bring up some very good points. It’s a lot to think through, you’re right, but I’ll revisit some of my earlier statements (although maybe not all in one post to keep things separate thematically and so I can further think over some of the info).

Perhaps this is the root of the issue with my original assumption. I think it was playing around the edge of my mind and I tried to dodge around it by stipulating eviction should be the networks response to poor health (although, even that has issues, as you pointed out, in that it allows nodes to influence the network’s acceptance).

In the end, by allowing the network to take any action whatsoever in response to another node’s health provides a lever for any individual node to control what the network does, at least to some degree. It reminds me of a post from @dirvine in your proposal for fast ephemeral routing.

It is about many different components that add privacy or security. No one thing provides all of either principle.

In some sense, this is another facet of that statement. In general, the problem is then what levers can we expose (and how can we protect those levers) so that the network isn’t “easily” gameable (what constitutes “easily” is obviously a bit of a loaded term)

In terms of what levers are available if we open up the full range of actions, I think you put it rather succinctly here:

So the question then simplifies further to how can we protect these levers “enough” that we’re satisfied they likely won’t be exploited. At this point in time, I think I doubt the existence of a solution which can’t be exploited to some degree. I haven’t finished reading through the thread linked above, but my impression is that attacks like a Google attack are prohibitively difficult to guard against completely and there is such a thing as trying to be too secure (in the sense that functionality shouldn’t suffer too much for security).

Protecting the levers in this case is tied to protecting our measurement of health (which may be beneficial to break down into long and short term as you’d mentioned. The assumption being that we can derive some notion of future needs from both current and past needs).

I really like the rewardable caches concept you linked! I wasn’t aware of it, but at first brush it seems to offer a lot of benefit like offering a route to assess spare space on the network and compare between vaults.

I’ve got a few more thoughts to expand on this, but for now I’ll save that for a later post as I’m out of time.

1 Like

I tend to think by speaking/writing and then refine them later by some combination of bouncing them off of other people and revisiting them on my own, so I hope it’s not too hectic. I’ll try to use a “summary” section at the bottom of a post to try and pull together my thoughts.

I’ll also have to go up to my original post and edit it. I wonder if posts have a version history? If a mod could make it a wiki that might be helpful too (I’m not entirely familiar with it, but I think it means anybody can edit the OP?).

Long Term/Section Health

So on a section level, if I am to (hopefully accurately) summarize, it seems the major concern of knowing the health boils down to adjusting network resources to meet demand appropriately. This is primarily a function of current supply, current demand, and estimated future demand.

As a first attempt, maybe we could express this symbolically:

Symbol Meaning Comment
R Resource Differential Estimated Resource Adjustment for Section
Sc Current Supply Resources Currently Available
Dc Current demand Current Demand for Resources
Df Future Demand (Estiamted) Demand of Future Resources
μ Future Uncertainty Multiplier 0 < μ < 1
f fail rate 0 < f < 1 – encompasses how successful we have been at meeting the supply/demand gap

Attempt:
R=( [ (1+f) * Dc ]- Sc)+μ(Df-Sc)


This could be, in some sense, the gradient of resources, which we want to drive to zero.

  • Characteristics
    • Supply is inversely related to resource gradient (high supply means low/negative requirements)
    • Demand now is directly related to requirements (high demand means higher requirements)
    • As fail rate increases, resource needs strictly increase. This makes sense as fail rate means that we may have less ability to meet needs, so effectively makes resource requirement higher for a given demand level.
      • This might happen when nodes are perhaps not healthy enough to service our requests (e.g. not enough processing power to churn through requests at a reasonable rate). This incidentally serves as an indirect measure of section health, which can be seen as an emergent property of individual vault health.
    • High future demand influences resource requirement, but to a lesser extent (expressed via μ)

The knobs we have to turn then are what was mentioned above.

Tool Effect
Reward Up Supply up
Reward Down Supply down
Evict vaults Supply down (less drastic than above)
Storage Cost Up Demand Down
Storage Cost Down Demand Up

Perhaps it’s worth noting that demand is a two-part metric constituting demand for storage (which we can influence via storage cost) and demand in the form of client GET requests (which we can’t influence directly afaik).

This hinges on something like rewardable caching which would allow the system to have a good finger on the pulse of current supply in a section in how deep vaults are caching on average. Current demand is a little simpler because we can keep track of the number of requests serviced by this section. Similarly, fail rate is measurable via the number of these requests which were successfully serviced without timing out, etc. With these two in hand, we can derive a future demand metric heuristically, so the first part seems feasible (determining R that is).

Driving that resource gradient to zero is a little more tricky, and probably would require constant tuning/tweaking, especially as we figure out better ways to estimate future demand for resources.

Figuring out which knobs to tune is a bit of an open question since all of these knobs can be turned alternatively or in combination. There is an RFC (section on establishing store cost) right now, but it seems to have a less granular approach to turning these knobs as it considers good nodes/full nodes as opposed to overall demand/supply for resources.

Short-term Health

As a short aside to short term health regarding, it’s entirely possible to continue to evict/replace/supplement nodes which can’t pass a periodic health check as described in my initial thoughts about using CPU, bandwidth, etc. This would then take a supplementary role though to the fail rate, which would be used to express overall level of a section to service requests. This would be better for detecting individual points of failure and bottlenecks for nodes which are particularly slow, such that their continued existence would be more detrimental than simply pulling in resources to supplement them, which would be done as part of the section-level health.

Summary

Perhaps I’ll posit that the primary reason that health is interesting on a section/long-term level is so a section can manage its resources (incentivize node joining/leaving or consolidating them manually via ejection). The main reason health is important on an individual node level is to make sure that a node can meet some bare minimum requirement (and be ejected if not) and to pull in resources if slow nodes prohibit network function (e.g. unnecessary slowdowns & failed/slow GET requests).

We use principles of supply and demand to adjust section-level resources in combination with the current level of estimated node health instead of the current naïve scheme using full and not-full nodes to pull knobs like store cost and farming reward in order to balance network resources.

More Questions (as if there weren’t enough already :slight_smile: )

If this sounds reasonable, the next step would be standardizing the units of supply and demand (e.g. transforming them into a common unit because things like average cache depth indicates supply but is not interoperable out of the gate with a measure of demand reliant on number of requests).

Then figuring out exactly how much turning each of our tunable knobs would contribute to driving resource requirements to zero and in what combination depending on current level. Maybe changing the contribution based on some thresholding model. E.g. maybe farming reward is a large driver of supply (and thus resource requirements), so we use this heavily when resource requirements are high and need larger swings. Alternatively maybe some sort of coordinate ascent could just turn each knob sequentially while holding the others constant, and maybe we could be happy enough with that.

2 Likes