Just ran a quick test and found 16 nodes offline(none proxy nodes). I remember @StephenC confirming yesterday all nodes were fine. Just running an invariant check right now to confirm the network state from routing pov.
I was able to login to my account too. However a section or so might still have data loss if enough nodes went offline together ofc(if only parsec was there for alpha-2). So could still be having partial data loss even if the network recovered(that is if the invariant check passes).
D 18-08-22 17:16:16.792881 [routing::states::node node.rs:3761] Node(395310..(0011)) Lost all routing connections.
D 18-08-22 17:16:16.792989 [routing::state_machine state_machine.rs:418] State::Node(395310..(0011)) Terminating state machine
Gonna take some time to get some details unfortunately. Will keep this thread updated ofc.
Had a poke about and there seems to be some networking weirdness that went on with the droplets for sure(not confirmed what yet though) which seems to have triggered some node loss leading to large scale data replication triggering further throttle control from DO itself … leading to more node loss.
Could get the routing network invariant restored, however not sure if all data is back available yet(some options exist to try and recover that too), however would wanna be sure about what caused this from DO in the first place to be sure it doesn’t just repeat itself again. Hopefully should have some more updates tomorrow.
@Viv I’m getting the same error again with code that was working on Sunday:
Core error: Routing client error -> Requested data not found
I suspect this is my code because whm seems to work fine, but am a bit baffled as to how I could be messing up such a simple thing (even after reverting to code that was working), so just want to check that there are no know problems with the network again?
After seeing this on my test account I created another (with no data uploaded) but see the same error consistently on both accounts. I’ve also rebooted my machine inbetween and I get this error every time in my app, but never when I list the folder in whm.
@happybeing, have you tried rolling back your changes to the point where it was working before? I see you say they were unrelated changes, so to confirm that. I see this is all part of a larger flow in your lib, can you provide a simplified sample code which can be used to reproduce the issue?
Yes I’ve rolled back the changes to Sunday in both repos (see tag wip-getcontainerbug-goodcode) even though none of them seem able to affect this area. This wasn’t happening as late as yesterday evening, so I’m as sure as I can be that it isn’t due to my changes to the code. This is why I wondered if it was problems with the network again - especially as the symptom was identical to the first time.
Also, I haven’t touched the auth code at all in months. I was making superficial changes to safe-containers.js in safenetworkjs.
I put that test getContainer('_public') call into bin.js to check before it gets to any of the code I’ve been working on.
Don’t you think the error itself is infeasible? How can a request to access _public result in that error? I really don’t know how to debug this.
The error looks suspicious to me which is why I’ve asked for help. I could create a simplified example, but I expect it will work - how could it not? So I think whatever the issue is that I’m triggering, it is worth finding out what is actually happening to cause that particular error message, because I think it is either a bug in the API, or reporting the wrong error, regardless of how I’m causing it to happen (if indeed I am - as I say, reverting to the earlier code hasn’t changed this).
I don’t know how to debug that so if you can’t help it will delay me while I learn how. If you can’t help debug, can you maybe try reproducing it?
Thanks. FYI I just tried cloning both repos and checking out the development branch (latest commit) to make sure I hadn’t accidentally corrupted anything (e.g. in the node_modules) but the problem is still there.
@happybeing, I was able to reproduce it, and it’s just that you are getting an authorisation URI from Peruse against the live (alpha2) network since you are running it with --live, but you are then actually trying to connect to mock network since you are running safenetwork-fuse with NODE_ENV=dev in this command: DEBUG=safe-fuse*,safenetworkjs* NODE_ENV=dev node --inspect-brk bin.js
I just tried removing NODE_ENV=dev from that command and the error is gone. Then I can see sometimes I get the Utf8("invalid utf-8: corrupt contents") error but that’s been solved in safe-app-nodejs v0.9.1 (you’d need to upgrade), so just after retrying I don’t see any other error but not sure what happens then, it seems the debugger looses the connection
Ah you are a life saver Gabriel, thanks so much. I’ve been wracking my brain trying to think what could have changed. I can see how that happened: usually I grab that command from the shell history but after a reboot I will have copied and pasted it from my notes.
Phew and yipeee! I can get back to work. Thanks again.
Sometimes you need a second pair of eyes - it might have taken me days to spot this. Which made me think.
I could certainly do this again, so will make notes to help me spot it, and others developing stand alone (desktop) apps will likely do this too. So I wonder if it would be worth detecting and flagging it - auth URI includes a flag (mock) that can be checked in the client library when used?
I’m guessing we haven’t seen this before because NODE_ENV was the only way to select mock/live, but with CLI also an option it will be easy for desktop app devs to do this.
What do you think - worth me creating a feature request?
It could be done, as you said, if the safe_authenticator lib can encode a bit/flag in the URI when it detects it’s working with mock, then the safe_app lib can decode it and throw an specific error if the flag was not expected or indeed expected. This would need to be done in the safe_client_libs so we’d need @nbaksalyar’s point of view?