Analyzing the Combinatoric Math in CashFusion

7 3176

CashFusion is actually the fusion of two powerful ideas. One is a contribution from myself, which is (in a nutshell): enable trustless, private, multi-input coinjoins with blame capabilities by using a commitment-based multiparty computation scheme.

The other idea, the one everyone is so excited about, comes from Mark Lundeberg. To quote the CashFusion spec:

"In CashFusion, we have opted to abandon the equal-amount concept altogether. While this is at first glance no different than the old naive schemes, mathematical analysis shows it in fact becomes highly private by simply increasing the numbers of inputs and outputs. For example, with hundreds of inputs and outputs, it is not just computationally impractical to iterate through all partitions, but even with infinite computing power, one would find a large number of valid partitions."

Then Mark provides some "napkin math":

"Now, there are 100!/(10!)^10 = 10^92 ways to partition the inputs into a list of 10 sets of 10 inputs, but only a tiny fraction of these partitions will produce the precise output list. So, how many ways produce this exact output list? We can estimate with some napkin math. First, recognize that for each partitioning, each output will typically land in a range of 10^8 discrete possibilities (around 1 BCH wide, with a 0.00000001 BCH resolution). The first 9 outputs all have this range of possibilities, and the last will be constrained by the others. So, the 10^92 possibilities will land somewhere within a 9-dimensional grid that contains (10^8)^9=10^72 possible distinct sites, one site which is our actual output list. Since we are stuffing 10^92 possibilities into a grid that contains only 10^72 sites, then this means on average, each site will have 10^20 possibilities"

The purpose of this article is to examine Mark's math, try to make sense of what it means, and hopefully help show why it's generally correct.

Now, first of all, I'm not a mathematician. But like many people, Bitcoin has inspired me to learn more than I probably would have. It's also pretty amazing how much cool math stuff is out there on Wikipedia. So i'll link to a few interesting articles that are relevant here as I go along.

What Does the Spec Claim?

First of all, is it true that it is "computationally impractical to iterate through all partitions"? That question is straightforward and answerable, because we can compute how many partitions we are talking about for a typical transaction, and the numbers are huge. Longer-than-the-age-of-the-universe-to-compute huge.

The other part: "even with infinite computing power, one would find a large number of valid partitions" is unclear how it could be proven rigorously, but we can logically argue it is true, and also demonstrate on a mini scale.

Let's look at the first part first:


Stirling Numbers

Combinatorics is a major branch of math that deals with counting and combinations.

Fortunately, there is already a great tool for solving the first question. Stirling numbers of the second kind allow us to determine the number of ways to partition a set of n objects into k non-empty subsets.

So, in CashFusion, let's say we have 8 players in a fusion round, and the transaction is 50 inputs and 50 outputs. The inputs and outputs are separate and distinct sets here.

For the inputs, we want to know how many possible ways can we break down 50 objects into 8 non-empty subsets. This is written as StirlingS2[50,8].

And the answer is: 35041731132610098771332691525663865902850, or 3.5 x 10^40.

Then we can do the same for the outputs, and in this case is the same number.

We can then multiply the 2 sets' number of partitions because any of the partitions on the input set can go with any of the partitions of the output set as a possible configuration for the transaction, and we get approximately 10^81.

10^81 possible combinations! Which is about how many atoms there are in the observable universe. Of course this number can get crazier. With a hundred inputs and outputs, StirlingS2[100,8]^2 =10^171.

But back to the example with 50 inputs and outputs: Also this is with knowing a fixed amount of 8 players, which is not really knowable looking at the blockchain. The total amount of all possible combinations based on pure chainanalysis would require you to sum up StirlingS2[50,1]...StirlingS2[50,2]... etc...all the way up to 50.

In this case, since inputs and outputs are the same, we get a Bell number. Usually, either the inputs or outputs would be higher than the other, so you'd have to stop the summation at the lesser of the two for it to make sense, but you'd get a similar number.


How Many Valid Partitions Do You Get?

This is a difficult problem, or at least seems to be. We can at least state the problem as follows:

Given set A with n elements and set B with m elements, where the elements are integers in the range of 1-10^9, how many partitions (on average) exist where all the groups of A and B match in terms of the sums of their elements.

The problem also may assume an exponential distribution of random values for the elements (or something).

Here's a few related problems, and links you might find enlightening or interesting, but if you look into it, as I have done, you will probably find that none of these are quite the same problem as ours.

https://en.wikipedia.org/wiki/Partition_problem

https://en.wikipedia.org/wiki/Subset_sum_problem

https://en.wikipedia.org/wiki/Knapsack_problem

https://en.wikipedia.org/wiki/Arithmetic_combinatorics

https://en.wikipedia.org/wiki/Composition_(combinatorics)

https://en.wikipedia.org/wiki/Partition_function_(number_theory)

Mini Fusion

Even though the problem is difficult, we should still be able to give an approximate solution. But first, it is instructive to look at a mini example of how effective fusion is at its heart.

Mark wrote a simulation program that takes 3 players, each with 6 inputs and 1 output. Each run of the program is randomly different, but when I ran it, I got this:

Input values:

[[ 11 20 35 37 166 207]
[ 12 82 84 102 120 190]
[ 4 5 40 64 94 307]]

Output values: [476 590 514]

All Transaction inputs: [ 4 5 11 12 20 35 37 40 64 82 84 94 102 120 166 190 207 307]

67.0 ways exist at actual configuration
list of all candidate decompositions with sum [476 590 514]

[[ 20 40 64 84 102 166]
[ 5 12 82 94 190 207]
[ 4 11 35 37 120 307]]

[[ 12 64 84 94 102 120]
[ 5 35 37 40 166 307]
[ 4 11 20 82 190 207]]

[[ 12 64 84 94 102 120]
[ 5 11 37 40 190 307]
[ 4 20 35 82 166 207]]

...

Now I didn't list all 67 ways this can add up. I just listed 3 of them. Hard to believe there's 67 ways right? Go ahead, pull out your calculator. Check a few. Run the program if you want. It's amazing and counter-intuitive.

Now, unfortunately, we can't do the same thing for bigger combinations because it would take forever to run.

We can argue that "Even though actual bitcoin amounts will add more decimal precision making matches more rare, we'll more than make up for it because the total amount of partitions will just be insane.

But is it really true?

Estimation

Since it is unclear how to solve the problem perfectly, let us attempt to estimate it. The basic approach Mark is taking is looking at the "size of the space". What does that mean, and does it hold water?

I arrived at a similar understanding by looking at the problem like so:

Say we have 50 inputs, 50 outputs, and 8 players. We know there's 10^40 ways we could divide the inputs into 8 groups, and the same for the outputs.

Now here's the key to the insight: If we took 2 groups at random (one from the inputs and one from the outputs) and compared them, what's the probability they would match if it was completely 100% random?

(We know it's not 100% random, but we'll come back to this point.)

Assuming Randomness

For the purposes of illustration, say the maximum value of a player's inputs or outputs is 10 BCH. That means that the set of possible values is somewhere in the range of 1-1,000,000,000 satoshis.

So if you had a certain input sum value, the chances it would match an output set randomly would be 1 in a billion. So each player would have a 1 in a billion shot at matching up their inputs and output amounts if it was assigned randomly.

You could then multiply those odds with itself for each new player. So if you have 8 players, you'd have one in a billion times one in a billion...etc. Actually there is one small twist: you only have to multiply it by itself 7 times instead of 8; (by process of elimination the last player's sums would also be correct.)

(Side note: this is also what Mark was talking about with a "9-dimensional grid".)

So you would get (10^9)^7, or 10^63, which is far smaller than 10^81. In other words, even though you only have a 1 in 10^63 chance of randomly picking a valid partition, you have 10^81 chances to do so. So you should end up with 10^19 valid partitions on average.

So, I'm basically saying the same thing Mark is, even if I arrived at the conclusion by taking a slightly different path.

Why randomness is an acceptable model

We know that each possible partition is the result of combinatorics working in effect with the initial configuration of the transaction, rather than being some purely random number. However, the configuration itself is for all intents and purposes random: the outputs are chosen randomly by the Electron Cash wallet, and the inputs can be whatever the players bring to the table, including outputs from previous fusions.

More importantly, the huge number of partitions combined with the availability of a plentiful number of transaction components of various sizes (small, medium, large, etc) makes it so that the combinations in general would take on a wide range of values. A partition can contain the smallest value by itself, or all the values including the largest, or anything in between. Thus, intuitively, it has a "wide distribution" much like a random number.

In fact, it could be true that the discrete set of numbers would be MORE likely to make matches possible, as we can witness in the mini fusion example.

An interesting tidbit I found in researching this, is if you take a toy example of 3 numbers which are either a 1 or 2, then you get combinations like "111, 112, 121...222". And if you sum those you get 3,4,4,4....6. A lot of ways to add up to 4 but not a lot of ways to add up to 3 or 6. If you were to graph it, you get a "lump" in the middle. There's some kind of "overlap" effect and an additive property that seems to make lots of combinations that work.

Conclusion

So the bottom line is, the number of partitions can grow a lot faster than the "rareness" of matches due to the granularity of Bitcoin transactions. Again, you just take a typical range of values (say a billion) and multiply it by itself for however many players you have (minus 1)... and that number is going to be many orders of magnitude smaller than the total number of possible partitions.

Proving it more rigorously sounds like a very interesting math problem and I'm looking forward to more of Mark's thoughts on the topic.

7
$ 110.21
$ 55.00 from @RogerVer
$ 15.00 from @MarcDeMesel
$ 14.37 from @molecular
+ 21

Comments

Greate work. Can you also link the last 2 wiki articles?

$ 0.00
4 years ago

read.cash didnt let me hyperlink those with the paranthesis in the anchor text for some reason.

$ 0.00
4 years ago
$ 0.00
4 years ago

Great insights from Jonald as usual.

$ 0.15
4 years ago

thanks jonald, great writeup that makes the issue more graspable for me. combinatorics and probability theory stuff can be so counter-intuitive.

the issue reminds me of a quote:

"it's very likely something highly unlikely will happen today simply because there is such a multitude of highly unlikely things that could happen."

$ 0.10
4 years ago

That's deep!

$ 0.00
4 years ago

https://www.comsys.rwth-aachen.de/fileadmin/papers/2017/2017-maurer-trustcom-coinjoin.pdf

White with brute force it the number of partitions is the bottle neck, which would be iterating through them twice:

https://github.com/nopara73/Notes/blob/master/BellNumber.md

It's actually possible to optimize, but not much. The Knapsack paper gave a lower bound estimation for how fast the subsets can be established in the best case, which is O((2^n)∗m).

https://github.com/nopara73/Notes/blob/master/SubSetSum.md

$ 0.10
4 years ago