The "red banner" day (post-mortem)
I'm sure many of you have noticed on November 13th a red banner that said that we were having a really-really bad day with the Bitcoin Cash REST endpoint at rest.bitcoin.com.
Don't get us wrong - we're greatly thankful to the bitcoin.com and Roger Ver for providing this node for free - it's a great service to Bitcoin Cash community! We understand that you can't demand 99.999% reliability from something you use for free.
But the prolonged downtime happened and the previous 2 days we were erratically exploring options on having a backup in case this happens again, because the payments were failing.
If you had an account - that's great - that means that in the worst case your payment didn't go through and stayed in your wallet (bad news for the author, though).
The situation was much worse for the "One time QR code" payments. Some people lost some money forever. As far as we know it's about $2 in total, but still it's their money.
In one case that was reported to us - we sent the author our own BCH instead of the forever lost $1.50 that the upvoter sent.
Here's the problem.
When you upvote something on read.cash - this needs to happen:
Often it's simpler:
But still it's not something that we can do with a single QR code. We could do it with BitPay invoices, I think, but not all devices and software support those.
In order to achieve this, we create an ephemeral (one-time) Bitcoin Cash wallet right in your browser.
Your browser controls it.
We don't know its private key or mnemonic, or anything (that's why there's now at least $1.50 in BCH that are "lost forever" on the blockchain).
So, usually this happens:
You send a transaction to your browser. Your browser crafts a really cool "fan out" transaction to pay all the involved (mostly the author, though).
Now, imagine the situation like yesterday. The Upvoter has a connection to a working BCH node. But their browser doesn't have, because it tries to access the node that is failing.
This happens:
The problem is that this first arrow is really one-way.
Your browser can't use the same node that you used to send the transaction to it.
Your browser can't find out where did the money came from to send it back.
Even if it could - it would still not be able to send it back.
What we tried
The first and simplest thing that we did was to ensure that the node is working before showing the QR code.
That didn't help much, since the node was "flashing" - it would seem like it was working, then it was down again the next second, when it needed to send "Transaction 2".
Our guess is that there is a load balancer and some of the nodes behind it were down, so the connectivity appeared randomly affected.
Since the private key was generated in the browser for the one-time transaction - after the failed attempt - the browser was closed and the private key was gone forever with the browser's garbage collector (a real thing).
Let's run our own full node!
First of all, we know that running a node is not easy. It's not like dusting off a Raspberry Pi and running a few commands.
It requires constant attention, especially during upgrades (and one is coming today!) But, it seemed like the only way. We needed a working copy of rest.bitcoin.com... and fast...
Regular full nodes would take weeks (days?) to synchronize all the blockchain. We didn't have weeks or days, so we decided to install Flowee The Hub - the fastest Bitcoin Cash node alive.
Let's just say - the documentation greatly oversimplifies what you need to do. It took us about 24 hours of stumbling across the docs and source code to get everything working ("The Hub" + "Indexer" + PostgreSQL + "Bitcore-Proxy" + "rest.bitcoin.com" copy).
Here comes the bad news. It didn't work in the end.
It turns out that the Bitcore Proxy doesn't implement all the functions that rest.bitcoin.com needs and that we need for read.cash.
So, that was a...
...dead end.
One of the many as it would turn out.
Bitcoin ABC, Insight, rest
Next on the list was installing Bitcoin ABC (which runs like close to 100% of Bitcoin Cash network) and then Bitcore, Insight API and then rest.bitcoin.com code. There should be a MongoDB node somewhere in between (which is already scary).
Turns out it's not that easy. You can't even use regular Bitcoin ABC, you need to use the modified one from BitPay and there's only version 0.18, when Bitcoin ABC is at 0.20 now, which is also scary to try. We would have been probably be out-of-sync after the upgrade on November 15th. But besides, the documentation for all of this is really non-existent. Do we need Bitcore or just Insight? Which of the commands do we need to run? There's a lot of info and we haven't found a clear description for how to do it. And we needed something quickly.
Seriously, try to read this through. It's one year old description and it's still "Open".
After that, we took a look at our Electron Cash wallet. It was running pretty well. How?
Electron Cash and ElectrumX
Turns out there's this thing called "ElectrumX" - an implementation of the "Electrum protocol" - it's a pretty simple protocol that you can use to connect to BCH Full Nodes ran by volunteers. (Kudos!)
So, we decided to write a simple proxy that would implement the few functions that we needed from the rest.bitcoin.com node.
NodeJS
The first attempt was in NodeJS (actually, we hoped we could use Electrum's servers right from the browser).
It failed when we were almost done. Turns out you need a direct connection (non-HTTP one) to ElectrumX. There is a WebSocket protocol, but all the nodes that we tried didn't support it.
...dead end.
Ok, let's then build a server proxy in NodeJS. That kind of worked, but we couldn't really catch all the random disconnections and internet problems that we simulated by turning the Wi-Fi off and on.
...dead end.
It wasn't NodeJS fault, but ours as developers (probably).
We just couldn't figure out fast enough how to catch all those problems.
Python
Next was a prototype in Python. We've had to use the new Async Python stuff and stumbled mostly into the same problems. When we turned Wi-Fi off and on again - occasionally async Python would just sit there and do nothing at all, blocking everything. Without a hint of an error or timeout.
...dead end.
Ok, again, not Pythons fault, it's we, the shitty developers, who couldn't do it.
Go
Finally, the last prototype was in Go. Using a ton of hacks (it is very long 48 hours and we've had very little sleep during those), but it kind of... sort of... works. Pretty resiliently.
Even though it's very hacky. Dear Go developers - close your ears. Seriously. There's JavaScript inside of our Go code. We bow our heads in shame. It was just much easier to write JS code to transform one JSON (from ElectrumX) to another JSON (rest.bitcoin.com) format. We should fix it, but it does the job.
We've even added the famous "Chaos Monkey" to the code that randomly disconnects nodes to see how would it behave. So far, quite good..
But we're the realists. Our code is only good enough as a "backup" node, when this happens to rest.bitcoin.com:
So, we'll still keep using rest.bitcoin.com and in the meantime keep exploring the options to run our own node.
The proxy is up and ready to take care of the users if rest.bitcoin.com starts having problems again.
It doesn't implement everything that rest.bitcoin.com does, but just the bare minimum required to make sure that read.cash works.
To eat our own dog food - all the development is now happening using this node.
Hopefully you'll never see this:
...but we'll sleep better knowing that this exists.
One-time QR codes precautions
The last thing that we needed to take care of after this disaster, was to do something about the QR code if all the precautions and checks, and backup plans fail..
In this case The Anonymous Upvoter would see this:
It's a pretty bad user experience, but it's our attempt of last resort to make sure that the user at least doesn't lose the money.
Conclusion
Every time something like this happens I never stop to wonder - how amazing it is that Bitcoin Cash "just works". I mean, sure, you have local failures like ours, but "the network" overall "just works".
Huge thanks goes to everyone making this possible!
Hopefully tomorrow we'll be developing something more interesting for read.cash, rather that dealing with the infrastructure again.
Not sure if this helps but https://twitter.com/zquestz/status/1195419982479400960?s=21