# Diagonal folding: Folding protocols with a large amount of rounds using 2-round Protostar.

*Thanks to DoHoon and Srinath Setty for useful discussions and suggestions.*

## Introduction

In the recent year, there has been a surge in new folding schemes. The one that is most dear to my heart is Protostar (I have described something really close in my previous post, and I’ll try to bridge this gap because the language I’ve used back then was imprecise, and now these ideas are much cleaner).

Protostar has a significant disadvantage, though - it functions poorly when the amount of rounds is large (even if each round’s communication is actually quite small). This is clearly an issue for protocols that are based on sumcheck, that have excellent communication complexity, but also significant round complexity.

I will describe a particular way of circumventing this issue.

I would like to point out that what I will be describing is NOT hypernova, but it remains to be seen whether these ideas can be combined in some way.

My original applications was using Lasso-styled lookup inside of a folding IVC, but there are also other applications.

### Fixing the terminology

Protostar is a large paper, including many different tricks. When I talk about the Protostar here, I mean the folding scheme which folds any (special sound public coin) protocol with verifier of algebraic degree d (which is equivalent to saying that we have an algebraic circuit represented as a system of equations of degree d, split into rounds, and with challenges sampled after each round).

This scheme is exactly what is described in particular cases in “folding endgame” post (though I prefer thinking about it in a “circuit” terms, and Protostar paper uses “protocol” terminology).

I will refer to the system of equations of degree d as “circuit in general arithmetization”. Protostar folding scheme doesn’t care about particular nature of these equations as long as they are of degree d.

Protostar paper also includes an ingenious trick which allows to remove most of the work devoted to computing MSM for cross-terms. I will refer to it as Protostar-transform, and will never explicitly invoke it; but it should be noted that everything that I’m saying is perfectly compatible with applying Protostar-transform as a last step.

The principal cost of Protostar-transform is 1 additional round.

### At the MSM barrier

Before we proceed, we need to familiarize with the landscape.

I suggest the intuitive notion of an “MSM barrier” - the fact that currently existing succinct proof systems require the prover to *at least* commit to the witness. Excluding FRI-based schemes (and focusing on homomorphic linear ones) we see that prover must expend at least time required to compute MSM of the size of the witness.

I will also make a case that the schemes I want to discuss are necessarily recursive or at least have linear prover. Say, Groth16 or PlonK technically are at the MSM barrier, but have n\log(n)-sized FFT computation which prevents them from scaling too much.

**Nova IVC** is not at MSM barrier, as its ec ops cost (excluding recursion overhead) is not only committing to witness, but also committing to cross-terms. These are rhs-sized, so each folding incurs additional MSM of size g, where g is the amount of equations.

**Sangria IVC**, a direct extension of Nova IVC to higher degrees, increases this cost dramatically, because it has (d-1) cross terms, each requiring an MSM of size g.

**Protostar-transform IVC** touches the MSM barrier. The reason for this is that transform converts the system of g degree d equations into 2 \lceil \sqrt{g} \rceil equations of degree 2 and a single large equation of degree d+2. The cross-terms are then computed separately for gates of degree 2 (which involves a negligible O(\sqrt{g})-sized MSM), and there are additional d+1-cross terms communicated without commitment (and even if we would commit to them, it would be a constant cost).

The principal cost of Protostar-transform is an additional round, and rounds in Protostar are cheap (almost free) for the prover, but increase the communication with verifier (as Protostar prover must commit to each round separately). In the IVC context this translates back directly as a prover cost, blowing up the recursive overhead r times, where r is the amount of rounds.

**Hypernova** - afaik, it is on the MSM barrier, but I am not sure. I do not understand exactly how recursive overhead looks, too.

### Beyond the MSM barrier

Somewhat surprisingly (and, personally, counter-intuitively for me, as I’ve always followed the MSM blueprint), there are already protocols on the other side of the MSM barrier.

As a sidequest, Protostar-transform almost cuts it - if we choose the gates of high degree, the witness size shrinks. However, not only this is increases the amount of cross-terms, it, crucially, takes more time to compute them. The advantage on witness size depends on the exact structure of the circuit, if the circuit is sequential, it will be logarithmic (to give a quick example, if we use 25-degree (2-round) gates in power=5 poseidon, we get 6x shrinking in witness, and 25x slowdown in computation of the witness. If we use 3-round gates, we get 9x shrink and 125x slowdown on the other side).

So it gives *some* advantage, but it is very limited. But mainly I want to talk about protocols based on sumcheck.

Lasso is a good example of what’s possible; the principal speedup stems from the “sparse commitment scheme” (actually <sparse, dense> inner product argument proof) called SPARK, which uses as the main tool the GKR protocol (I will describe it in the next sections, but here is the reference).

One could wonder if it is possible to fold GKR. Technically, it has a linear verifier, but a lot of things can be batched. So, while what I will describe is applicable to any sort of multi-round protocol, I will have GKR specifically as an example (and SPARK as specific instantiation).

## Multi-round protocols - folding diagonally.

### Notation

I have a witness vector W, which is a concatenation of “rounds” W = W_1 \# W_1 \# ... W_{r} . There are also challenges P_0 \# P_1 \# ... \# P_{r-1}, with P_0 denoting vector of public inputs, and other P_{i}-s - challenges occuring after round i.

There is a system of equations e_i(W, P) = 0, *homogeneous* of degree d. (If the system is not homogeneous, we can always reduce it by introducing an additional public input called relaxation factor u in Nova paper, setting it to 1 and multiplying monomials of degree i by powers u^{d-i}). Further, I will never refer to it explicitly, and just assume WLOG that the system is homogeneous.

A **committed instance** corresponding to such circuit is the following tuple:

Sequence of commitments [W_1], ..., [W_r], public input P_0 and sequence of challenges P_1, ..., P_{r-1} *sampled using Fiat-Shamir* from sequence of [W]-s.

There is also a *relaxed* version of this system, which has 2 changes:

- There is a right-hand side, called error vector E, satisfying e_i(W,P) = E
- Challenges are now not assumed to be generated via Fiat-Shamir, and are treated as public inputs.

A **committed relaxed instance** is the following data:

[W_1], ..., [W_r], public inputs (and “challenges” now treated as such) P_0, ..., P_{r-1}, commitment to the error vector [E].

**Satisfying witness** to the committed instance is a vector W such that [W_i]-s are commitment to W_i-s, and the system is satisfied. For the relaxed version it is, predictably, a pair [W], [E] with analogous properties.

In terms of Hypernova paper, what I will describe next is a (1,1)-multifolding scheme - i.e., it takes 1 committed instance and 1 committed relaxed instance, and returns 1 committed relaxed instance. I think this is a convenient terminology.

### Protostar folding

A reminder on how Protostar (non transformed) works - it is very similar to Nova, and the main trick is just pretending that challenges are public inputs. References for particular early cases of this trick are Origami / Folding Endgame aka moon-moon.

Let’s denote relaxed instance Z_1 = (...[W_i]^1..., ...P_i^1..., [E]^1), and instance Z_2 = (...[W_i]^2..., ...P_i^2...).

Prover1: Send Z_1, Z_2, and commitments to cross-terms [T_i] defined as follows:

e(W^1 + xW^2, P^1 + xP^2) = e(W^1, P^1) + x T_1 + x^2 T_2 + ... + x^d e(W^2, P^2)

Verifier1: check that challenges in Z_2 generated correctly. Outputs random challenge t.

The folded relaxed instance is now (... [W_i]^1 + x[W_i]^2 ..., ...P_i^1 + xP_i^2..., [E]^1 + \underset{k=1}{\overset{d-1}\sum} t^k [T_k]).

We can also fold 2 relaxed or 2 non-relaxed entries in analogous fashion. It is important that challenges for non-relaxed entries are checked to be obtained via Fiat-Shamir.

**Message passing trick**

I do not want to go into moon-moon-ish constructions here in details, but there is a particular useful trick which is useful if we need to copy some data from one instance to another. This is relatively important, as my folding protocol will need to pass a lot of messages between different steps of the folding sequence, and it is uneconomic to do it through public inputs.

The trick is as follows: let’s introduce an additional “dummy round” without challenges, and copy data that we need to pass there (via our system of equations). Then, in a different (also non-relaxed) instance require that the commitment to this dummy round is the same, thus, ensuring that data also coincides.

If we need to pass some messages forward during sequential computation, we actually need to introduce 2 “dummy rounds”, one to receive the data from the previous step, and another one to pass forward.

This optimization is optional (and I will pretend that data is passed through public inputs), but I think is generally desirable.

### Folding for multi-round protocols.

From now, I will assume that there is an underlying algebraic protocol \Gamma (and corresponding circuit), with two additional requirements - all equations only involve two neighboring rounds, and there is the same amount of challenges in each round. We assume that the amount of rounds is r.

A lot of protocols already have this form or can be easily coerced into such - for example, sumcheck almost has such form, with only exception that we need to somehow pass all the challenges downwards for the final check. The challenges will be passed with the different construction, anyways.

Then, we will use circuit C with two rounds, such that each round hosts, in parallel, r different runs of a protocol \Gamma. Second round passes the state to the 1st round of the next step (via public output API or message-passing trick).

Now, there are two approaches, with various tradeoffs:

**Direct parallelization approach**

Folding sequential computation must be arranged in such a way, that r protocols are started at the same time. After r folding steps, they have all ended.

This has a bit of a problem with a load balancing for sumchecks, because all rounds but the last are actually quite small. The additional cost of passing challenges is r witness size in each step.

**Delayed start approach**

Initialize the 1st protocol in 1st step, then 2nd in the 2nd step, et cetera. This has no load balancing problem at all (as in the step k somewhere in the middle we are doing 1st round of k-th protocol, 2nd round k-1-th protocol, …, r-th round of k-r+1-st protocol).

This system requires k+r folding steps to process k protocols with r rounds.

### Cost analysis

The principal overhead on the witness size is 2x (counting the send-receive circuitry), except of the first and last round. Practically, last round in sumcheck is by far the largest, so we do not really care about this part. It also incurs static O(r) increase in witness in each step.

The choice between direct parallelization and delayed start approaches depends on the exact nature of a protocol. Delayed start approach has an undesirable property that 1st run gets challenges \gamma_1, ..., \gamma_r, second run challenges \gamma_2, ..., \gamma_{r+1}, et cetera. This can make batching of the final openings harder.

## Applications

TBD - this section is incomplete

I believe the main applications are Lasso and plain GKR, with intent of breaching MSM barrier. The principal cost of GKR is

- Committing to the inputs O(w) to the layered circuit of depth d and width w.
- Running O(d*\log(w))-round sumcheck.
- Opening d “state transition” polynomials, each multilinear in 2 \log(d) variables.

The only thing bothering me here is 3, but it seems that in direct method we batch for free r times because of opening in the same sequence of challenges, and then can batch even more using normal folding (these polynomials are multilinear extensions, so, being multilinear, they have a total degree at most 2 \log(d) - which means that such polynomial can be treated as a single Protostar gate, and normal folding applies).

Delayed start method is likely worse, but is also manageable (and, once again, load balancing seems to be an issue). Though maybe we could get best of both worlds using some sort of Supernova / Protostar-0-witness-is-free tricks.