The "red banner" day (post-mortem)

17 159
Avatar for Read.Cash
5 years ago

I'm sure many of you have noticed on November 13th a red banner that said that we were having a really-really bad day with the Bitcoin Cash REST endpoint at rest.bitcoin.com.

Don't get us wrong - we're greatly thankful to the bitcoin.com and Roger Ver for providing this node for free - it's a great service to Bitcoin Cash community! We understand that you can't demand 99.999% reliability from something you use for free.

But the prolonged downtime happened and the previous 2 days we were erratically exploring options on having a backup in case this happens again, because the payments were failing.

If you had an account - that's great - that means that in the worst case your payment didn't go through and stayed in your wallet (bad news for the author, though).

The situation was much worse for the "One time QR code" payments. Some people lost some money forever. As far as we know it's about $2 in total, but still it's their money.

In one case that was reported to us - we sent the author our own BCH instead of the forever lost $1.50 that the upvoter sent.

Here's the problem.

When you upvote something on read.cash - this needs to happen:

Often it's simpler:

But still it's not something that we can do with a single QR code. We could do it with BitPay invoices, I think, but not all devices and software support those.

In order to achieve this, we create an ephemeral (one-time) Bitcoin Cash wallet right in your browser.

Your browser controls it.

We don't know its private key or mnemonic, or anything (that's why there's now at least $1.50 in BCH that are "lost forever" on the blockchain).

So, usually this happens:

You send a transaction to your browser. Your browser crafts a really cool "fan out" transaction to pay all the involved (mostly the author, though).

Now, imagine the situation like yesterday. The Upvoter has a connection to a working BCH node. But their browser doesn't have, because it tries to access the node that is failing.

This happens:

The problem is that this first arrow is really one-way.

  • Your browser can't use the same node that you used to send the transaction to it.

  • Your browser can't find out where did the money came from to send it back.

  • Even if it could - it would still not be able to send it back.

What we tried

The first and simplest thing that we did was to ensure that the node is working before showing the QR code.

That didn't help much, since the node was "flashing" - it would seem like it was working, then it was down again the next second, when it needed to send "Transaction 2".

Our guess is that there is a load balancer and some of the nodes behind it were down, so the connectivity appeared randomly affected.

Since the private key was generated in the browser for the one-time transaction - after the failed attempt - the browser was closed and the private key was gone forever with the browser's garbage collector (a real thing).

Let's run our own full node!

First of all, we know that running a node is not easy. It's not like dusting off a Raspberry Pi and running a few commands.

It requires constant attention, especially during upgrades (and one is coming today!) But, it seemed like the only way. We needed a working copy of rest.bitcoin.com... and fast...

Regular full nodes would take weeks (days?) to synchronize all the blockchain. We didn't have weeks or days, so we decided to install Flowee The Hub - the fastest Bitcoin Cash node alive.

Let's just say - the documentation greatly oversimplifies what you need to do. It took us about 24 hours of stumbling across the docs and source code to get everything working ("The Hub" + "Indexer" + PostgreSQL + "Bitcore-Proxy" + "rest.bitcoin.com" copy).

Here comes the bad news. It didn't work in the end.

It turns out that the Bitcore Proxy doesn't implement all the functions that rest.bitcoin.com needs and that we need for read.cash.

So, that was a...

...dead end.

One of the many as it would turn out.

Bitcoin ABC, Insight, rest

Next on the list was installing Bitcoin ABC (which runs like close to 100% of Bitcoin Cash network) and then Bitcore, Insight API and then rest.bitcoin.com code. There should be a MongoDB node somewhere in between (which is already scary).

Turns out it's not that easy. You can't even use regular Bitcoin ABC, you need to use the modified one from BitPay and there's only version 0.18, when Bitcoin ABC is at 0.20 now, which is also scary to try. We would have been probably be out-of-sync after the upgrade on November 15th. But besides, the documentation for all of this is really non-existent. Do we need Bitcore or just Insight? Which of the commands do we need to run? There's a lot of info and we haven't found a clear description for how to do it. And we needed something quickly.

Seriously, try to read this through. It's one year old description and it's still "Open".

After that, we took a look at our Electron Cash wallet. It was running pretty well. How?

Electron Cash and ElectrumX

Turns out there's this thing called "ElectrumX" - an implementation of the "Electrum protocol" - it's a pretty simple protocol that you can use to connect to BCH Full Nodes ran by volunteers. (Kudos!)

So, we decided to write a simple proxy that would implement the few functions that we needed from the rest.bitcoin.com node.

NodeJS

The first attempt was in NodeJS (actually, we hoped we could use Electrum's servers right from the browser).

It failed when we were almost done. Turns out you need a direct connection (non-HTTP one) to ElectrumX. There is a WebSocket protocol, but all the nodes that we tried didn't support it.

...dead end.

Ok, let's then build a server proxy in NodeJS. That kind of worked, but we couldn't really catch all the random disconnections and internet problems that we simulated by turning the Wi-Fi off and on.

...dead end.

It wasn't NodeJS fault, but ours as developers (probably).

We just couldn't figure out fast enough how to catch all those problems.

Python

Next was a prototype in Python. We've had to use the new Async Python stuff and stumbled mostly into the same problems. When we turned Wi-Fi off and on again - occasionally async Python would just sit there and do nothing at all, blocking everything. Without a hint of an error or timeout.

...dead end.

Ok, again, not Pythons fault, it's we, the shitty developers, who couldn't do it.

Go

Finally, the last prototype was in Go. Using a ton of hacks (it is very long 48 hours and we've had very little sleep during those), but it kind of... sort of... works. Pretty resiliently.

Even though it's very hacky. Dear Go developers - close your ears. Seriously. There's JavaScript inside of our Go code. We bow our heads in shame. It was just much easier to write JS code to transform one JSON (from ElectrumX) to another JSON (rest.bitcoin.com) format. We should fix it, but it does the job.

We've even added the famous "Chaos Monkey" to the code that randomly disconnects nodes to see how would it behave. So far, quite good..

But we're the realists. Our code is only good enough as a "backup" node, when this happens to rest.bitcoin.com:

So, we'll still keep using rest.bitcoin.com and in the meantime keep exploring the options to run our own node.

The proxy is up and ready to take care of the users if rest.bitcoin.com starts having problems again.

It doesn't implement everything that rest.bitcoin.com does, but just the bare minimum required to make sure that read.cash works.

To eat our own dog food - all the development is now happening using this node.

Hopefully you'll never see this:

...but we'll sleep better knowing that this exists.

One-time QR codes precautions

The last thing that we needed to take care of after this disaster, was to do something about the QR code if all the precautions and checks, and backup plans fail..

In this case The Anonymous Upvoter would see this:

It's a pretty bad user experience, but it's our attempt of last resort to make sure that the user at least doesn't lose the money.

Conclusion

Every time something like this happens I never stop to wonder - how amazing it is that Bitcoin Cash "just works". I mean, sure, you have local failures like ours, but "the network" overall "just works".

Huge thanks goes to everyone making this possible!

Hopefully tomorrow we'll be developing something more interesting for read.cash, rather that dealing with the infrastructure again.

2
$ 1.72
$ 0.60 from @DarthRoison
$ 0.50 from @Cain
$ 0.50 from @DavidRAllen
+ 3
Sponsors of Read.Cash
empty
empty
Avatar for Read.Cash
5 years ago

Comments

$ 0.10
5 years ago

Thanks for the advise! Though we can't use it :)

You can't call gRPC directly from the browser, so it's again the custom solution. Also the solution that we have now has like 5 backend nodes to get information from, if we were to use that one - we'd be back to "one backend".

Also their protocol is different and we wanted something that has the same protocol as rest.bitcoin.com, so that we only have to switch nodes, not support two different protocols. The problem is that if you have to support second protocol too - that would mean that it's one more "moving" part. I.e. it's possible that during the downtime of rest.bitcoin.com we'd discover that we have errors in protocol implementation for BCHD and still be down :)

To be clear - we're reasonably sure that our final solution would work during the downtime.

But, again, thanks for the effort!

$ 0.00
5 years ago

Oh, yes you can do gRPC requests straight from the browser just fine ;)

https://gitlab.com/acid.sploit/go-playground/blob/master/grpc/grpc-web-docs/mempool-monitor/src/Live.js

https://gitlab.com/acid.sploit/go-playground/blob/master/grpc/grpc-web-docs/bchrpc-web.md

As to your other gripes with gRPC, I would consider completely moving over to gRPC and drop the JSON REST api. There are probably a couple more public servers, or you should also spin one up yourself.

$ 0.50
5 years ago

Cool, thanks for the links! We'll research them later. For now it seems our solution should work just fine in any case, but if it doesn't - we'll research gRPC.

$ 0.00
5 years ago

I would advise you to consider implementing a simple database table that can hold transactions that should be broadcast, but have not been broadcast yet due to technical issues. In case Bitcoin.com or your own node is down, you just write the transactions temporarily into your database, so that you can broadcast them later once the downtime ended.

$ 0.10
5 years ago

The problem is that if we have no connectivity - then we can't get UTXOs and therefore can't build the transaction and sign it. We could overengineer a solution where our backend would try to reach through one of the available nodes for UTXOs and give it to user, then we'd proxy all transactions through our server, caching and retrying them.

But currently we should have like 99.99..% availability (if rest.bitcoin.com goes down - we should detect it within a minute and until it goes up - all requests go to the backup node, which is resiliently connected to 5 other ElectrumX nodes). I don't think that 0.01% is worth the engineering effort, which is better spent on better features (like paywalls, sponsorships and improving the editor).

For the unlucky 0.01% we just show the mnemonic to sweep the funds back. I really think that's enough :)

Though thank you for your thoughts! Appreciated!

$ 0.00
5 years ago

Try Bitcoin Unlimited node. Almost 50% Bitcoin Cash nodes are running it.

$ 0.05
5 years ago

If Bitcore requires specially modified Bitcoin ABC, it's very probable that it requires modified Bitcoin Unlimited also. The problem is not with the node - Flowee is excellent as node, the problem is with rest.bitcoin.com compatibility.

Thanks for the advise though!

$ 0.00
5 years ago

Will there be a feed with new posts on read.cash?

$ 0.00
5 years ago

Do you mean like RSS feed?

$ 0.00
5 years ago

It doesn't have to be an RSS feed. It is enough a page with the newest posts.

$ 0.00
5 years ago

I don't understand. Isn't main page https://read.cash what you're looking for?

$ 0.00
5 years ago

The main page is not sorted by newest. Sometimes older posts appear before the newer ones.

$ 0.00
5 years ago

Yeah, it's sorted by the "priority", so upvoting kind of gives a boost. Ok, I understand what you mean - kind of like "hot", "new", "top" links on Reddit. Yeah, those are planned.

$ 0.06
5 years ago

Yes, that's what I mean.

$ 0.00
5 years ago

That is now done!

$ 0.10
5 years ago

Great, thanx.

$ 0.00
5 years ago