Whole day went due to an unexpected issue with Hive Engine witness nodes

0 15
Avatar for bala41288
2 years ago

Today morning when I woke up I was quite happy looking at the price of Hive. I do this every morning. Along with that as a routine, I also check the witness nodes too. I visit the Hive Engine witness channel on Discord to see if there is anything interesting going on or anything burning. Today morning too I did the same. I found out that there were some issues going on. By the way, I'm running a Hive Engine witness node and I also manage some of the witnesses for other people as a SaaS. They pay me for the server and maintenance and I run a witness node for them.

That is when I knew that the whole day is going to be gone there. The same has happened. I had to do restore and fix stuff the whole day. In between, I was also doing a lot of reading today when I was waiting for the restoration to happen and the blocks to catch up. Not just me but many fellow witnesses also had to do the same thing. For some reason, there was a block where the witnesses diverged.

Why is it taking a lot of time?

If something goes wrong with the witness servers, the round verification stops. New blocks would still be created but the verification will not happen. That is when the witnesses will have to find out what we should be doing to fix the issue and then apply the fix. If the issue is simple, it would need a simple fix but if people get forked out, we have to restore the blockchain from a recent backup.

This is where the major challenge is. The size of the backup file itself is 20 GB. We have to download that from a common server which can take close to 30 minutes on average. After it is done we have to restore the mongo database from that. This is the most painful part here. The DB restoration can take from 2 hours to even 8 hours. On machines that are not Nvme, it can even take days. I recently upgraded my servers to Nvme and now things are better.

After the database restoration is done, we have to wait for the blocks to catch up. Only then we can register the witness. But even after that, there are possibilities for the blocks to diverge again. This happened a couple of times today. I had to repeat the restoration activity again and again. At the time of writing this article, I'm doing the database restoration for the 4th time. I hope this time everything goes smoothly.

A few more hours before we are fully up and running

I guess in another few more hours we should all be up and running. Some of the witnesses are currently restoring. Some witnesses have started verifying blocks already. But we still don't have enough witnesses yet. I'm waiting for this to complete so that I can go to sleep. The whole day went in this activity and I don't want to sleep without completing the task.

It was indeed good learning. I was able to explore a few options where we can improve the way we take backups and restore the witnesses from backups. I wish we had a facility similar to the differential backup we have in MSSQL. Mongo doesn't support such incremental backups efficiently yet.

Posted with STEMGeeks

$ 1.03
$ 1.03 from @TheRandomRewarder
Avatar for bala41288
2 years ago