Chapter 10
Clifford+T

In the previous chapters, we have often seen a dichotomy between two classes of quantum computations. On the one hand, we have Clifford computations, which have a great deal of structure we can exploit to efficiently solve problems like circuit synthesis, deciding equality of computations, and strong classical simulation. On the other hand, we have looked at universal quantum computation, typically arising from adding Z[α] gates for arbitrary angles α [0,2π]. While Clifford angles, i.e. integer multiples of π 2 , satisfy many identities, we have so far treated non-Clifford angles as “black boxes”, which don’t seem to satisfy any extra rules, except for trivial ones coming from the spider fusion rule like

and variations thereof. It turns out that for generic angles, this is pretty much all we can do. In fact, as we’ll explain in the References, there is a way to make this statement precise. However, for dyadic angles, i.e. angles of the form π 2k for some integer k, there is a great deal more structure at play. As we’ll see in this chapter, we can take advantage of this dyadic structure in a variety of ways. First, we will see that it simplifies the problem of synthesising generic unitary maps using just Clifford gates and the first non-Clifford dyadic phase gate T := Z[π 4 ]. We can characterise the set of unitary matrices that we can synthesise exactly using Clifford+T gates as those whose entries are all within a certain subset, called 𝔻[ω], of the complex numbers. Using some special properties of this set, and a little (light) ring theory, we can figure out precisely how many gates we need to synthesise such a matrix exactly and do this synthesis efficiently. Note, here “efficient” means efficient in the size of the matrix, not the number of qubits. For generic unitaries, this is the best we can hope for. We’ll also see that for any tolerance 𝜖 > 0, we can approximate any unitary matrix within 𝜖 with a 𝔻[ω]-matrix, which we can in turn synthesise exactly using Clifford+T gates. Second, we will see that for dyadic angles, new rules start to hold that wouldn’t for generic angles. In particular, certain complex configurations of phase gadgets can all cancel each other out in ways that are not implied by the gadget-fusion law we met back in Chapter 7. These so-called spider nest identities come from a particular interaction between the parity (i.e. mod-2) structure of the phase gadget itself and the mod-2k structure of its angle. Similar to the representation from Chapter 7 of CNOT+phase circuits, we can represent phase gadgets as collections of binary vectors representing parities of qubits where phases get applied, and as before we can stick these vectors together into a matrix. We introduce a new Scalable ZX notation so that we can directly represent these matrices in an efficient way in our diagrams. Using these matrices representing collections of phase gadgets we can also recognise which configurations of π 4 phase gadgets will cancel out. Namely, it will be those whose “gadget matrix” satisfies a condition called strongly 3-even. In this chapter, we will work out some properties of strongly 3-even matrices and a closely related 𝔽2-linear structure called Reed-Muller codes. Finally, we will put the pieces together by classifying all the spider nest identities and explaining how they can be used for T count optimisation. Clifford+T circuits are especially important for fault-tolerant quantum computation, so in this chapter we will also be (secretly) laying the groundwork for Chapter 11, where we explain techniques for implementing universal quantum computation within a quantum error correcting code. In particular, in Section 10.5 we will see how we can relate circuits over different types of gate sets together using a technique called catalysis that allows you to perform certain ‘hard’ operations in an easy way as long as you have a suitable catalyst state lying around. Together with the classification of spider nest identities this will prove very handy for when we want to implement non-Clifford gates in fault-tolerant architectures.

10.1 Universality of Clifford+T circuits

Back in Chapter 2, we noted that CNOT gates plus arbitrary single-qubit unitaries are universal for quantum computation, in the sense that they can be used to build arbitrary unitaries over n qubits. In this section, we will show that, in fact Clifford+T circuits are approximately universal, in the sense that they can get arbitrarily close to any n-qubit unitary. One way to show this, is to show how for any single-qubit unitary U there exists some U arbitrary close to U expressed totally in terms of H and T gates. There are several ways to prove this. The “classic” way is to show that V := HT corresponds to a rotation about some axis of the Bloch sphere by an irrational multiple of π. Then, by raising V to larger and larger powers, we will eventually land close to any possible rotation around that axis. We can do the same around some other axis, e.g. with V = TH, to obtain a pair of rotation gates that suffice to build any single-qubit unitary. It takes quite some work to spell out the details of this argument, and this has been done in several standard textbooks on quantum computing. There are also many variations one can use to obtain more or less efficient decompositions of single-qubit unitaries. We’ll give some pointers to where you can find all the gory details of this approach and variations at end of this chapter. However, in this section, we will start with a totally different approach to synthesis of unitaries in Clifford+T, which is based on number theory. This approach starts from the realisation that the numbers appearing in a unitary matrix built from CNOT, H, and T are not just any old complex numbers, but are actually quite special. First, lets have a look at the matrices again:

CNOT = ( 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 )H = 1 2 ( 1 1 1 1 )T = ( 1 0 0 e4 )

Clearly any U that we can construct from these gates will have as its entries sums and products of integers, 1 2, and the complex phase ω := e4. If we think about where ω lies on the complex plane:

(10.1)

we see that ω = 1+i 2 and ω¯ = ω3 = 1i 2 . Using these facts, it’s not hard to see that we can already build 1 2 using just integers, 1 2, and ω:

1 2(ω ω3) = 1 2

If we look at just the numbers we can build with integers and 1 2, this consists of precisely the rational numbers whose denominator is a power of two. We give this set of numbers a special name.

Definition 10.1.1. The dyadic rational numbers 𝔻 consist of all rational numbers of the form z 2k for z,k .

Clearly 0,1 𝔻, and for q,r 𝔻, q,q + r and qr are also in 𝔻. Hence, 𝔻 forms a ring. However, unlike the full set of rational numbers, not every q 𝔻 has a multiplicative inverse in 𝔻, so it is not a field. It turns out that, for the purposes of circuit synthesis, this is not a bug, but a feature, since rings can have very interesting properties owing to the fact that we cannot arbitrarily divide numbers by each other. To discuss the single-qubit synthesis algorithm, there are two relevant rings based on 𝔻 we need to study. The first is 𝔻[ω], the ring obtained by allowing arbitrary sums and products of dyadic rationals with the complex number ω. The second is [ω] 𝔻[ω], which is restricted just to integer multiples of powers of ω. Such rings are called ring extensions, i.e. rings obtained by adding one or more elements outside of the original ring and closing under sums and products. Since ω4 = e = 1, we can concretely represent elements of [ω] and elements of 𝔻[ω] using quadruples of numbers taken respectively from or 𝔻:

[ω] := {a + + cω2 + dω3|a,b,c,d } 𝔻[ω] := {a + + cω2 + dω3|a,b,c,d 𝔻}

Exercise 10.1. Define ring addition and multiplication for [ω] and 𝔻[ω] as operations acting on quadruples (a,b,c,d) of numbers respectively taken from and 𝔻.

Clearly any unitary matrix built out of CNOT, H, and T will consist of elements from 𝔻[ω]. Perhaps surprisingly, the converse is also true: any unitary matrix whose elements are in 𝔻[ω] can be constructed exactly as a composition of CNOT, H, and T gates. We will show this by giving a concrete algorithm for synthesising unitaries over 𝔻[ω] in the following sections, then conclude with some remarks on how this lifts to an approximate synthesis algorithm for unitaries over all complex numbers.

10.1.1 Exact synthesis of one-qubit gates

We’ll begin by considering the simplest non-trivial synthesis problem we can have: namely preparation of an arbitrary single-qubit state |ψ. That is, we want to find a sequence of H and T to transform |0 into |ψ. Equivalently, we can find a sequence of H and T transforming |ψ into |0, then take the adjoint. One way to do this is starting with a state whose entries are in 𝔻[ω], and then applying gates to it until all of the entries are in [ω]. The following lemma should make clear why this helps us.

Lemma 10.1.2. If |ψ is normalised and all of its entries are in [ω], then it must be a computational basis vector, up to a global phase.

Proof. For any element z = a + + cω2 + dω3 in 𝔻[ω], we can get an explicit form for |z|2 = z¯z in terms of 2 = ω ω3:

|z|2 = z¯z = (a2 + b2 + c2 + d2) + (ab + bc + cd da)2.
(10.2)

If we suppose that all the a,b,c,d are integers, then the only way to have 0 |z|2 1 is when the 2 part here is negative. Let |ψ = iψi|i be a normalised state with entries in [ω]. Writing |ψi|2 = ni + mi2 where ni,mi and ni 0 and mi 0, we have 1 = ψ|ψ = i|ψi|2 = i(ni + mi2). Since we know that the total sum is 1, the 2 components should cancel: imi = 0. But all the mi are negative, so the only way this can happen is if all mi = 0. Similarly, since all the ni are positive integers and they sum to 1, they must all be zero except for one nj which is equal to 1. But then exactly one of aj,bj,cj and dj is ± 1, and the rest are zero. Hence |ψ = ωk|j for some i,k. □

If we get to ωj|0, mission accomplished, up to a global phase. If we get to ωj|1, we simply apply an X gate (i.e. HT4H) and again we are done. So, the name of the game is turning the coefficients of |ψ into elements of [ω]. One potential strategy is to factor |ψ as:

|ψ = 1 2k (x|0 + y|1) for x,y [ω]

then apply gates to |ψ to try and make the coefficients x and y into even numbers. Then, we can factor out a 2 from x and y and the leading scalar becomes 1 2k1 and we’ve made some progress toward getting a state whose coefficients are in [ω]. This does work, but it is a bit tricky to find the gates we need to apply to |ψ to “make progress”. We can make life a bit easier if we express |ψ differently, as:

|ψ = 1 δk (x|0 + y|1) for x,y [ω]
(10.3)

where δ := 1 + ω is a “special” number that has some nice number-theoretic properties that will help with our synthesis algorithm (for more details on the number theory side of things, check out the advanced section 10.7.1). First, we should be able to check that |ψ is indeed a vector over 𝔻[ω]. For this, we should check that 1 δ 𝔻[ω], which is not immediate because not every element in 𝔻[ω] has a multiplicative inverse. The elements of a ring that do have multiplicative inverses are called units. We can see that δ is a unit in 𝔻[ω] by letting δ1 := 1 2(1 ω + ω2 ω3) and calculating δδ1 = 1. Conversely, we would like to know that any |ψ with coefficients in 𝔻[ω] can be written in the form of (10.3). This is true if any q 𝔻[ω] can be written as x δk for some x [ω] and a high enough power k of δ. Put another way, we need to prove the following property of δ.

Exercise 10.2. Show that, for any q 𝔻[ω], there exists k such that δkq [ω]. Hint: Using the fact that 2 = ω ω3, first show that δ2 = λ2, where λ := 1 + ω + ω2 is in [ω].

The smallest k needed to express |ψ in the form of equation (10.3) is called the least denominator exponent (lde). If the lde is 0, then |ψ is already has coefficients in [ω], so by Lemma 10.1.2, it must be a basis state, up to a phase. It is easy to see that k is minimal precisely when x and y are not both divisible by δ, i.e. when there exists no pair a,b [ω] such that = x and = y. We can easily check divisibility by δ using δ1. Since δ1[ω], then it is not necessarily the case that zδ1 [ω] for some x [ω]. In fact, it will be in [ω] precisely when δ divides x. The final piece of the puzzle is the following lemma, which lets us decrease the lde by applying H and T gates.

Lemma 10.1.3. Let |ψ = 1 δk(x|0 + y|1) be a state where x,y 𝔻[ω] are non-zero. Then there exists some l {0,,7} such that HTl|ψ = 1 δk(x|0 + y|1) for x,y divisible by δ.

Since x,y become divisible by δ, we can factor a δ out and reduce the least denominator exponent by 1. If we repeat this process over and over, eventually the lde will be 0, so by Lemma 10.1.2 it will be a basis vector. The reason why Lemma 10.1.3 works has to do with the special behaviour of δ within the ring [ω]. Namely, if we start computing numbers modulo δ (or powers of δ) in [ω], we will see that x and y can always be broken down into a particular form that tells us how many T gates to apply to make progress. To understand this, we will need to define the notion of a residue class, which lets us formalise what it means to work “modulo” some element of an arbitrary ring. We will do this and provide a proof for Lemma 10.1.3 in Section 10.7.1. However, if we believe the lemma, we now have a synthesis algorithm for qubit states. In fact, this also already gives a synthesis algorithm for single-qubit unitaries. This is because the columns of a unitary matrix must be orthogonal, so if we send the first column to a basis vector, the second column will also get sent to a basis vector, up to a phase. So after transforming the basis vector to the correct positions and phases, we will have synthesised the entire single-qubit unitary. We summarise this procedure in Algorithm 4.

_____________________________________________________________________ Algorithm 4: Exact synthesis for qubit Clifford+T gates___________________ Input: A unitary U with elements in 𝔻[ω] Output: A list of H and T gates implementing U
1..

Let |ψ and |ϕ be the columns of U. Since U is unitary, |ψ and |ϕ must be orthogonal.

2..

Express |ψ as:

|ψ = 1 δk (x|0 + y|1) for x,y [ω]

where k is the least denominator exponent.

3..

Try to apply HTl for all l {0,,7} to x|0 + y|1 to obtain x|0 + y|1 where δ divides x and y.

4..

Factor out a δ to obtain:

|ψ = HTl|ψ = 1 δk1 (x|0 + y|1)
5..

Repeat until the lde is 0 to obtain a sequence of gates sending |ψ to ωj|i. Optionally, apply a final X gate to send |ψ to ωj|0.

6..

Since unitaries preserve orthogonality, the same sequence of gates will send |ϕ to ωj |1. Perform a final Tjj to remove the relative phase between |0 and |1. Then GsG1U = ωjI.

7..

Return G1Gs, which implements U up to a global phase.

______________________________________________________________________________________

Example 10.1.4. Consider the following unitary over 𝔻[ω]:

U = 1 2 ( 1 ω3 1 + ω ω2 + ω3 1 + ω3 )

We need to first express U in terms of a [ω] matrix multiplied by 1 δk, where k is the lde of U. This can be accomplished by multiplying U by δ repeatedly until we get a matrix U whose entries are in [ω]. For our chosen U, we needed to multiply by δ three times to get a [ω] matrix, giving us the following expression:

U = 1 δ3 ( 2 + 3ω + 2ω2 1 ω + ω3 ω ω2 ω3 2 3ω 2ω2 )

So, we have an initial lde of 3. We then try to apply Tl for some l {0,,7} followed by an H gate to reduce the lde. Picking l = 0 (i.e. just applying an H gate) does not reduce the lde, but l = 1 reduces the lde from 3 to 2:

The lde is not zero yet, so we do another iteration, this time finding we can reduce the lde at l = 3. Namely, if we apply T3 followed by H, we reduce the lde from 2 to 0:

Our matrix now contains only elements of [ω], so we know the first column must be a basis vector, up to a phase. We find it is ω3|1. We can change it to ω3|0 by applying an X gate:

Finally, we finish by correcting the relative phase with a final application of T3, obtaining identity, up to a global phase:

Moving everything but U to the RHS, we conclude that:

10.1.2 Approximating arbitrary single-qubit gates

Suppose we want to use the exact synthesis algorithm from the previous section to approximate arbitrary, single-qubit unitaries. We now know that we can build any 2 × 2 unitary matrix over the ring 𝔻[ω], and it is not hard to see that any complex number can be approximated to arbitrarily high precision by some element of 𝔻[ω]. Indeed, if we take a 2j + b 2kω2 = a 2j + b 2ki for a,b and some suitably high values of j,k , we can get as close as we like to an arbitrary complex number. We seem to be most of the way there on coming up with an approximate synthesis algorithm. The problem is, if we go through a complex-valued unitary matrix U element-by-element and approximate each complex number with an element of 𝔻[ω], odds are we won’t get something unitary but just very nearly unitary. So we need a better idea. Like in the exact synthesis case, inspiration comes from looking at the least denominator exponenent. The idea is, for a target unitary matrix U and a fixed error bound 𝜖 > 0, we progressively raise the denominator exponent k until we can find some V = 1 δkV where

1..

V is a matrix over [ω],

2..

V is unitary, and

3..

for some global phase α, V eU 𝜖.

It turns out condition 2 is already quite restrictive. Since V is a unitary, its columns need to be normalised. Hence, the norm-squared of the columns of V must be |δ|2k. Using this fact, we can prove that for fixed k, there are only finitely many V to choose from.

Exercise 10.3. Show that, for a fixed k , there are only finitely many x,y [ω] such that |x|2 + |y|2 = |K|2 for any constant K [ω]. Show that this implies there are only finitely many matrices V over [ω] such that V = 1 δkV is unitary for any fixed k. Hint: Use (10.2) to get an explicit form for |x|2 and |y|2 in terms of their integer coefficients.

Since we already know that there are finitely many V satisfying conditions 1 and 2 above for fixed k, we know that there must also be finitely many V satisfying all 3 conditions. Hence, we can enumerate them for a fixed k. If we find a solution, we accept it. Otherwise, we increase k and try again. It turns out that there is a way to do this enumeration of candidates satisfying conditions 1–3 efficiently, and that it terminates for k = O(log 1 𝜖). The way this works has to do with properties of certain discrete subsets of the complex plane, which are a bit advanced for our purposes. Hence, we’ll finish our story on synthesis here, and make a few more remarks about this in the advanced Section* 10.7.3.

10.2 Scalable ZX notation

So far in this book we have used ZX-diagrams where each wire represents a single qubit. As we have seen, this already allows us to do a lot of interesting stuff. However, sometimes there is repetitive structure in the diagram that we should really try to abstract away, so that we don’t miss the forest for the trees. The way we do that will be with scalable ZX notation. This notation allows us to compactly represent operations on registers of many qubits, while still maintaining much of the flavour of calculations with standard ZX-diagrams. We will represent a register of qubits as a single thick wire:

(10.4)

Sometimes we will need to split a register into two registers, peal one qubit off to do something with, or merge registers together again. For that we introduce a little bit of extra notation:

(10.5)

We call these operations divide and gather. They don’t actually do anything as linear maps (they are just the identity), but they help us compactly write down more complicated maps. They come with some simple rewrites:

(10.6)

Note that in general we will only label the thick wires with the number of qubits they represent when it is necessary for clarity, and leave it implicit otherwise. Just being able to represent a qubit register as a single wire is not that useful, so let’s introduce some things we can do with them. We will use bold spiders (with a thick border) to represent a product of (unconnected) copies of a Z or X spider as a bold spider:

(10.7)

Note that in this notation each thick wire connected to the same spider needs to represent the same number of qubits, otherwise it is not a valid diagram. These bolded spiders can be labelled by a phase, and this phase then gets applied to each qubit separately. As the bolded Z- and X-spiders represent non-interacting parallel spiders, all the standard ZX rewrites still apply to them, and hence we can still do spider fusion, strong complementarity, etc. We can push spiders through a dividers and this in fact gives us a type of bialgebra between the Z/X spider and the dividers and gatherers (cf. Section 3.2.4):

(10.8)

Note that this also works when the spider has no additional outputs:

(10.9)

The most important component of the scalable notation, and what makes all of this worthwhile, is a new piece of notation called the matrix arrow, or just arrow for short, which allows us to represent arbitrary connectivity from m Z-spiders to n X-spiders using an n × m biadjacency matrix A 𝔽2n×m:

(10.10)

Taking the convention that Aij represents the entry in the i-th column and j-th row of the matrix A, we have in Eq. (10.10) that Aij = 1 if and only if the i-th Z-spider on the left is connected to the j-th X-spider on the right. Just using a single matrix arrow we can hence write down an arbitrary phase-free ZX-diagram in parity normal form (Definition 4.2.2), by directly writing the biadjacency matrix A on the matrix arrow. Concretely, Eq. (10.10) corresponds, up to scalar factors, to a linear map that acts as A on computational basis vectors:

(10.11)

Note that we treat the bit string b as a column vector for the purposes of matrix multiplication. Let’s consider some special cases for the matrix arrow, an all zero matrix and the identity:

(10.12)

Using Eq. (10.11), we can see that when we compose matrix arrows, we just calculate the matrix product:

(10.13)

We can also prove this diagrammatically by unrolling the definitions and using strong complementarity on all the internal spiders (this is essentially doing the reduction to parity normal form from Chapter 4). Spiders and arrows interact in nice ways. First, we have two “copy” laws allowing us to push arrows through spiders:

(10.14)

These can be proven by unrolling the definitions and using strong complementarity and spider fusion.

Example 10.2.1. Let A = ( 1 0 1 1 ). Then:

Second, we can express block matrices in terms of spiders:

(10.15)

We can combine these operations to get a decomposition of a matrix into smaller matrices:

(10.16)

We could view this as the definition of the matrix arrow, because this allows us to define it inductively, starting from trivial components and combining these into larger matrices. The block matrix and composition rules also imply a graphical rule for the sum of two matrices.

Proposition 10.2.2. For any 𝔽2 matrices A,B, we have:

Proof. Starting from the LHS, we can decompose A = AI and B = BI and apply (10.13) and (10.15):

10.2.1 Scalable phase gadgets

Collections of phase gadgets turn out to look quite simple when using scalable ZX notation. First, let’s see how we can represent a single phase gadget:

(10.17)

Here 1T is the row vector (1 1), and hence the matrix arrow represents a collection of Z-spiders connected to a single X-spider, as needed. If instead the phase gadget is only connected to a subset of the wires, then we can replace 1 by a vector v where vi = 1 iff the gadget is connected to the ith qubit. We can then see how we can represent a composition of phase gadgets in scalable notation:

(10.18)

This construction generalises to any number of phase gadgets, for which we then get:

(10.19)

In this k × n matrix M each of the k rows describes a phase gadget, and each of the n columns corresponds to a qubit. Hence, Mij = 1 iff the jth phase gadget is connected to the ith qubit. Note that we can prove the gadget fusion rule of Section 7.1.2 in the scalable setting using what we have seen. This applies when we have two phase gadgets with the same connectivity, and hence the same row appears twice in the matrix:

(10.20)

Conversely, if we have gadgets with phases that are not π 4 , but multiples of that, we can also represent them like (10.19), by unfusing them into additional rows of the matrix (doing (10.20) in reverse).

10.3 Rewriting Clifford+T diagrams

In Chapter 9 in order to construct the Toffoli and CCZ gate using more low-level gates, in particular T gates, we used a Boolean Fourier transform to switch from the multilinear phase polynomial (1)xyz used in the CCZ gate to a phase polynomial built out of XOR terms that can be constructed using phase gadgets. In essence this all boiled down to the equation:

x y z = 1 4(x + y + z x y x z y z + x y z)
(10.21)

We used this equation to argue for the following diagrammatic equality:

(10.22)

But there is nothing special about the phase of the CCZ gate here, and in fact we can write a similar equation for a CCZ(α) gate:

(10.23)

Now when α = 0 both sides of this equation are obviously equal to the identity (just copy some spiders and cancel some identity spiders). But this should then also hold for α = 2π, and then this fact becomes less obvious. On the left-hand side we then have ei2π = 1 so that it is still the identity, but on the right-hand side we get a bunch of e±i2π4 = e±2 phases. In that case we can show this by using the Y eigenstate identity of Exercise 3.15, the Euler decomposition of the Hadamard and local complementation:

(10.24)

In this case we can hence still prove this identity with the tools we have already seen. But when we generalise Eqs. (10.21) and (10.23) to work with more than 3 wires we start getting new and very useful identities. As the number of XOR terms blows up exponentially it will be helpful to introduce a slightly more compact way to talk about them. Note that we can represent a parity like x1 x3 x4 with a bit string y = 1011 since y x = y1x1 y2x2 y3x3 y4x4 = x1 x3 x4. We can hence write Eq. (10.21) more compactly as

x1 x2 x3 = 1 4 y0(1)|y|y x.

Here x = x1x2x3 and the sum over the y goes over all the bit strings 𝔽23 except for 0 = 000. We can then easily write down the generalisation of this equation to n variables x1,,xn as follows:

x1 xn = 1 2n1 y0(1)|y|y x.
(10.25)

Now if we take n = 4, and we consider the CCCZ gate, which applies the phase polynomial ex1x2x3x4, then applying Eq. (10.25) would result in a bunch of phase gadgets with a phase of ±π 8 . But this is the Clifford+T chapter, so we want ±π 4 phases. We can get those by instead considering the trivial phase polynomial e2πix1x2x3x4. This then results in a constellation of 24 1 = 15 ±π 4 phase gadgets:

(10.26)

While this might look like some sort of confusing alien spacecraft, there is some order to the picture above: it contains all the possible phase gadgets on four qubits: all those with one leg (the π 4 phases directly on the qubit wires), two legs, three legs, and the single four-legged one. All the gadgets with an odd number of legs have a phase of π 4 , and all the gadgets with an even number of legs have a phase of π 4 . Eq. (10.26) is a genuinely new diagrammatic equation, a type of equation we call a spider-nest identity. As we will see in this chapter and the next, there are many uses of such equations. As a first application, note that if we bring the four-legged phase gadget to the other side of the equation that this says that whenever we have a four-legged gadget with a ±π 4 phase, we can replace it by a collection of 14 three-, two- and one-legged phase gadgets involving all of the four-legged phase gadget’s legs. In fact, we can decompose any n-legged phase gadget with a phase of ±π 4 into a collection of phase gadgets with a most 3 legs. For instance, when n = 5:

(10.27)

So while we would a priori think that we could need O(2n) different phase gadgets, one for each possible parity, we see that we actually only need O(n3), only those with at most three legs. A different use-case for Eq. (10.26) is that we can use it to optimise the number of T gates needed for a circuit. When we have a collection of phase gadgets with π 4 phases we can look for any subset of four qubits that has many phase gadgets and then use a version of Eq. (10.26) where we have those phase gadgets on the left and all the other parities on the right. Then by applying this equation we essentially ‘toggle’ which phase gadgets on these four qubits were present. As long as we started out with at least half of all the possible phase gadgets present, so at least 8, we end up with fewer phase gadgets. If we had 8 phase gadgets, then we get 15 8 = 7 phase gadgets at the end. If we had 10, then we end up with 15 10 = 5 of them. The exact phases, whether + π 4 or π 4 is not important for this, since if the signs don’t match, this just introduces a π 2 Clifford phase gadget.

Exercise 10.4. In Eq. (10.26) the two-legged and four-legged phase gadgets have a π 4 phase. Show that by unfusing a π 2 phase gadget from the four-legged one and applying a set of rewrites similar to those in Eq. (10.24) that we can rewrite it to a collection of phase gadgets all having a + π 4 phase.

10.3.1 Spider nests as strongly 3-even matrices

Using scalable ZX notation, we can represent a collection of phase gadgets using a parity matrix to describe the connectivity of the gadgets as in (10.19). The 4-qubit spider nest identity of Eq. (10.26) with 15 gadgets (more specifically, the modified version of Exercise 10.4 where all the phases are + π 4 so that we can get rid of the repeated rows) then corresponds to the following 15 × 4 matrix:

( 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 )T
(10.28)

Note that we write here the transpose of the matrix to use the space on the page a bit better. Any spider nest identity will be an identity of the form:

(10.29)

This means that when hunting for such identities, we are really looking for a particular type of boolean matrix. To find out what kind of matrices we need, we need to look at the pseudo-Boolean Fourier transform again. This time however, we want to translate phase gadgets back into controlled phase gates. The ‘inverse’ of the equation (10.25) which relates an AND to a sum of XORs is given by:

x1 xn = S[n](2)|S|1 iSxi.
(10.30)

Here [n] := {1,,n}, and |S| is the number of elements of S. Hence, each term iSxi = xi1 xik has a weight that is a power of 2 in the decomposition. Now, if we are considering this in a phase polynomial, where our phase gadget has a coefficient of π 4 , then in the decomposition the different terms will have a weight of ± 2|S|π 8 . Since these are phases, we are working modulo 2π so that when |S| > 3 each term is a multiple of 2π and disappears:

eiπ 4 x1xn = exp (i (π 4 jxj π 2 i<jxixj + π i<j<kxixjxk 2π)) = exp (i (π 4 jxj π 2 i<jxixj + π i<j<kxixjxk)) (10.31)

Hence, each phase gadget corresponds to a collection of T gates (the linear terms), CS gates (quadratic terms) and CCZ terms (cubic terms).

(10.32)

If instead of a single gadget we have a collection of phase gadgets, then we can add together the respective T, CS and CCZ gates of their decompositions. It turns out that such a circuit can only be equal to the identity if all the gates cancel in the trivial way. That is, each qubit should have a number of T gates that is multiple of 8 since T8 = I, each pair of qubits should have a multiple of 4 CS gates and each triple should have an even number of CCZ gates.

Exercise 10.5. Let C be a circuit consisting of T, CS and CCZ gates and suppose we have done all trivial cancellations as described above. Show that if C implements the identity, that then C must be the empty circuit. Hint: First consider input states |x with Hamming weight 1 to show using C|x = |x that there can’t be any T gates, then consider inputs with Hamming weight 2 to show there can’t be CS gates, and finally consider input states with Hamming weight 3 to rule out CCZ gates.

From Eq. (10.31) we see that if a phase gadget involves the qubit j, that then a T gate appears on that qubit. If it connects to both qubits j1 and j2 then there will be a CS gate on those qubits, and similarly a CCZ appears when the qubits j1,j2 and j3 are in the phase gadget. Hence, a collection of phase gadget implements the identity when each qubit is part of 0mod 8 gadgets (T8 = I), each pair of qubits is part of 0mod 4 gadgets (CS4 = I), and each triple is part of 0mod 2 gadgets (CCZ2 = I). We can formalise this cancellation property as follows.

Definition 10.3.1. We say a matrix M is strongly 3-even when

i : lMil = 0(mod 8) i < j : lMilM jl = 0(mod 4) i < j < k : lMilM jlM kl = 0(mod 2).

That is, when each column, product of pairs of columns and product of triples of columns has a Hamming weight that is a multiple of respectively 8, 4 and 2. We say M is 3-even when all three conditions only hold modulo 2.

When we represent a collection of π4 phase gadgets as a set of bit strings y1,,ym, where yil = 1 means the lth phase gadget is connected to the ith qubit and set the matrix Mij = yij we see that M is strongly 3-even precisely when the gadgets make a spider nest.

Proposition 10.3.2. Let M be a k × n boolean matrix. Then

if and only if M is strongly 3-even.

Being just 3-even, as opposed to strongly 3-even, is a weaker condition that means the collection of gadgets is not exactly equal to the identity, but instead are equal to some Clifford unitary: instead of the T gates exactly cancelling due to them appearing a multiple of 8 times, they only appear a multiple of 2 times on each qubit and hence combine into T2 = S gates, which are Clifford. Similarly, the CS gates appear an even number of times to create CZ gates, and the CCZ gates still completely cancel. For convenience we will also call a set of gadgets that is equal to a Clifford (and hence corresponds to a 3-even matrix) a spider nest identity. We can find which Clifford it implements by rewriting all the gadgets using Eq. (10.31). However, if we have gadgets with many legs this can involve exponentially many terms. There is also an efficient way to do it.

Exercise 10.6. Describe an efficient procedure that determines the stabiliser tableau for a circuit consisting of π 4 phase gadgets with the promise that it implements a Clifford unitary. Hint: It is diagonal, so all the Pauli Z’s trivially commute through. Pushing an X through the circuit however results in some additional π 2 phase gadgets that can be commuted to the end. If the unitary is Clifford these should all be decomposable into Paulis.

10.3.2 Proving all spider nest identities

We can now see when a collection of phase gadgets is a spider nest identity: check whether its associated parity matrix is strongly 3-even. Diagrammatically this is not very satisfying however, as it doesn’t tell us how to find strongly 3-even matrices or what is required to diagrammatically prove all these identities. In this section we will see that we can build all spider nest identities from one particular one. To do so, we need to go back to the 4-qubit spider nest Eq. (10.26) which we arrived at by decomposing the ‘trivial’ phase polynomial e2πix1x4 into phase gadgets using Eq. (10.25). We can similarly get a 5-qubit spider nest by decomposing e4πix1x5. This then will have 25 1 = 31 different phase gadgets. Note the factor of 4π instead of 2π which is needed to get a collection of ±π 4 phases instead of ±π 8 phases. We can in the same way build an n-qubit spider nest by decomposing e2n3πix1xn , which will result in a spider nest with 2n 1 phase gadgets with a ±π 4 phase. Now, let’s take the 5-qubit spider nest, and then ‘subtract’ the 4-qubit spider nest from it:

(10.33)

Here on the left-hand side we first have all the phase gadgets on five qubits, but then we subtract from those all the phase gadgets that don’t act on the first qubit. On the right-hand side we are then left with precisely those phase gadgets that do involve the first qubit. These are hence the parities like x1, x1 x2, …, x1 x2 x3 x4 x5. Since these parities can involve any combination of the other four qubits, there are 24 = 16 such parities (or alternatively you can see that we subtract 15 parities from 31 parities in Eq. (10.33) to arrive at 16 parities).

Exercise 10.7. Generalise Exercise 10.4 by showing that for any n 4, the collection of all phase gadgets with phase + π 4 is the identity. Use this to argue that in the identities described above where we subtract two fully connected spider nests from each other we also get an identity that only has positive + π 4 phases.

It is this 16-gadget identity (10.33) that turns out to generate all the other ones. We need to do some work to see that though. The first step is to construct such collections of phase gadgets in a more systematic way.

Definition 10.3.3. The spider-nest maps sn : 1 n are constructed inductively as follows:

(10.34)

Intuitively, this inductive definition results in a phase gadget connecting the single input wire to every subset of the output wires. For example:

where the last step follows from applying strong complementarity to the marked spider pair, and then applying spider fusion (sp) as much as possible. Let’s formalise this intuitive explanation of sn using scalable notation. Let Bn be the n × 2n matrix whose 2n rows consist of all n-bit strings. That is, the matrix defined inductively as follows:

B0 = ()Bn = ( Bn1 0 Bn1 1 )

where 0 and 1 are respectively the column vectors of all 0’s and all 1’s. For example, we have:

B1 = ( 0 1 )B2 = ( 0 0 1 0 0 1 1 1 )
(10.35)

Lemma 10.3.4. For all n, we have:

(10.36)

Proof. First, note that:

(10.37)

Using this equation and the scalable rules, we can prove (10.36) from (10.34) by induction on n:

Now, we can put s4 in a circuit and it can represent exactly the 16-gadget 5-qubit identity (10.33):

(10.38)

By plugging diagram of ketplus-1 into all the inputs and yanking the first output to be an input we can present this in a slightly simpler way:

(10.39)

We call this the (S4) rule, and it is enough to prove all the spider nest identities. First, let’s note that because s4 disconnects, all the sn for n 4 also disconnect.

Lemma 10.3.5. For n 4, the Clifford ZX-calculus augmented with the S4 rule implies:

(10.40)

Exercise 10.8. Prove Lemma 10.3.5 by induction on n with the base case n = 4 being (10.39).

In Eq. (10.38) we used s4 to represent the spider nest consisting of all gadgets connected to the first qubit. By using sn for n 4 and using Lemma 10.3.5 we see then that the collection of all gadgets on n + 1 qubits that are connected to the first qubit is also an identity. We can generalise this to the set of all gadgets connected to the first k qubits. To do this, note that if we connect sn to an X-spider on the left that we obtain the following:

(10.41)

where 1 is the k × 2n matrix where every entry is 1. Hence, to represent a connection to the first k qubits, we can just compose sn with an X-spider on its inputs. This then also leads to an identity:

(10.42)

Note that the special case of k = 0 gives us the set of all phase gadgets on a set of wires, like the 15-gadget identity (10.26) (in that case Bn has 16 rows, but the all-zero row of Bn corresponds to an unconnected phase that can be removed as a scalar). As it turns out, every spider nest identity can be decomposed into a composition of the identities of the form (10.42), and so (S4) indeed suffice to prove all of them. To show this we will need some more machinery however.

10.3.3 Spider nests as Boolean polynomials

In the previous sections we saw that we can reduce a collection of phase gadgets to a series of bit strings denoting the connectivity of the gadgets:

Here we are ignoring the phase gadget with a π 2 phase, as it is Clifford. We saw in Eq (10.19) that these bit strings can be stored in one big matrix and in this way we can efficiently write down a collection of gadgets in scalable notation. Storing them all in a matrix is however not the only way in which we could capture the information of a set of bit strings. We could instead represent them using its Boolean indicator function. That is, to a set of bit strings S 𝔽2n we associate the function fS : 𝔽2n 𝔽2 defined by fS(y) = 1 iff y S. We should note two important details about representing a collection of gadgets by its indicator function. First, this representation cannot deal with repeated gadgets / bit strings, and so this does not capture the exact phase of the gadgets (whether it is + π 4 or π 4 for instance). This means that when representing a collection of gadgets by its indicator function that we only represent it up to some Clifford information. Second, because 0 corresponds to a phase gadget not interacting with any qubit, we don’t care about the value of the function at this value. It would hence be more accurate to write the functions as f : 𝔽2n {0} 𝔽2, but we will ignore this detail for now. Any Boolean indicator function f : 𝔽2n 𝔽2 corresponds to a collection of phase gadgets: a gadget with connectivity y is in the collection if f(y) = 1. Some collections of gadgets are spider nests, so let’s call f a spider-nest function when its corresponding collection of gadgets forms a spider nest (up to a possible Clifford unitary). I.e. if a collection of bit strings S forms the rows of a 3-even matrix, then fS is a spider-nest function. The indicator function of the 4-qubit spider nest of Eq. (10.26) is the constant function 14 : 𝔽24 𝔽2 that always returns 1, since every phase gadget is part of the spider nest. The 5-qubit spider nest of Eq. (10.38) that contains all the gadgets using the first qubit has as indicator function X1 : 𝔽25 𝔽2 that maps X1(y) = y1, since any bit string y is part of the set of gadgets if y1 = 1. The spider nest of Eq. (10.42) that uses all gadgets touching the first k qubits has indicator function X1Xk : 𝔽2k+n 𝔽2, which acts as X1Xk(y) = y1 yk. These spider-nest functions are examples of Boolean monomials. A monomial is a function constructed by multiplying simple bit-indicator functions Xj together. For instance X1X3X4(y) = X1(y) X3(y) X4(y) = y1 y3 y4. We call the number of bit-indicator functions in such an expression the degree of the monomial. For instance the spider-nest function X1Xk has degree k. It turns out that in general, any monomial of degree at most n 4 corresponds to a spider-nest identity. Indeed, X1Xk : 𝔽2k+n 𝔽2 for n 4 is a spider-nest function. By permuting the qubits these give us all monomials of degree at most n 4. The number 4 in n 4 comes from the fact that the smallest spider-nest identity, corresponding to the Boolean function 14 acts on four qubits. Note that the spider-nest functions form a linear space: suppose that both fS1,fS2 : 𝔽2n 𝔽2 correspond to spider nests. Then if we take the composition of all the phase gadgets with parities in S1 and that of S2 we end up with a new set of phase gadgets covering all the parities in S1 and of S2. However, the gadgets that are in both S1 and S2 will fuse and hence get a Clifford phase. We are ignoring the Clifford unitaries, so we see then that the collection of non-Clifford phase gadgets corresponds to the symmetric difference S1ΔS2. The indicator function is then fS1ΔS2 = fS1 fS2. As this XOR of functions is just the sum in 𝔽2, we see that any sum of spider-nest functions is again a spider-nest function, so that the spider-nests form a linear subspace of all Boolean functions. In particular, we can take a sum of monomials that are all spider-nest functions and create a Boolean polynomial that is a spider-nest function. Any Boolean function f : 𝔽2n 𝔽2 can be written in a unique way as a polynomial f = aλaX1a1Xnan where λa 𝔽2 are the coefficients that determine which monomials are in the decomposition of f. The degree of a polynomial is then the maximum degree of its monomials. As a sum of spider-nest functions is still a spider-nest function, we then see that any Boolean polynomial of degree at most n 4 is a spider-nest function. What about the converse? Does any spider-nest function correspond to a degree at most n 4 polynomial?

Lemma 10.3.6. A matrix M with n columns is 3-even if and only if its indicator polynomial fM is of degree at most n 4.

Proof. Let M be a matrix obtained from M by removing all repeated pairs of rows. M is 3-even if and only if M is, and both matrices have the same indicator polynomial. Hence, we can assume without loss of generality that M has no repeated rows. Write f = fM for the indicator function of M. Let P(r,n) denote the set of n-bit polynomials of degree at most r. We need to show that f P(n 4,n). Define an inner product on Boolean functions by g1,g2 = yg1(y) g2(y) = yg1(y)g2(y)(mod 2). We call g1 and g2 orthogonal when g1,g2 = 0 and we define P(r,n) as the space of functions that are orthogonal to all functions in P(r,n). We claim that P(r,n) = P(n r 1,n). With this claim it then remains for us to show that f P(3,n). Let’s do that first. Let g = Xjf. This is then a polynomial with g(b) = 1 iff bj = 1 and f(b) = 1. Hence bg(b) is equal to the Hamming weight of the jth column of M. This also works for products of columns: for h = XiXjXkf, we have bh(b) equal to the Hamming weight of the element-wise product of the i,j and kth rows, which is hence zero mod 2 because of 3-eveness. Now bh(b)(mod 2) = XiXjXk,f so that f is orthogonal to all degree-3 monomials XiXjXk. These span P(3,n), and hence f P(3,n) as desired. Now to prove the claim P(r,n) = P(n r 1,n) first note that if a polynomial f of n variables has degree less than n, then bf(b) = 0(mod 2). This is easy to check for monomials, as any monomial of degree < n must omit some variable xj, so that

bf(b) = b,bj=0f(b) + b,bj=1f(b) = 0(mod 2)

By 𝔽2-linearity of the map f bf(b)(mod 2) this then holds for all polynomials. Now, for any polynomial f of degree at most r and g of degree at most n r 1, f g has degree at most n 1. Hence f,g = b(f g)(b) = 0(mod 2). This implies P(n r 1,n) P(r,n). To show this is actually an equality, we will do dimension counting. Note that for a 𝔽2-vector space V and A V we have dim (A) = dim (V ) dim (A) because the dimension of A is restricted by independent linear equations specified by a basis of A. Since P(r,n) has the monomials of degree r as its basis, dim (P(r,n)) = d=0r(n d). By manipulating binomial coefficients, we can then see that:

dim (P(r,n)) + dim (P(n r 1,n))
= d=0r(n d)+ d=0nr1(n d) = d=0r(n d)+ d=r+1n(n d) = 2n = dim (𝔽 22n ),

so that dim (P(n r 1,m)) = dim (P(r,n)) and these spaces must be equal. □

We then see that we have proven the following.

Theorem 10.3.7. Let y1,,yk describe the connectivities of k gadgets with a π 4 phase acting on n qubits. Then the following are equivalent.

Theorem 10.3.8. The Clifford ZX rules plus the (S4) rule suffice to prove all spider-nest identities. That is, given any collection of π 4 gadgets that implements a Clifford unitary, we can rewrite this into this Clifford unitary using just the regular Clifford ZX rewrite rules together with (S4).

Proof. Let M be the n × k 3-even matrix describing a spider-nest identity. Then its corresponding indicator polynomial fM is a sum of monomials fM = jlmj of degree at most n 4. Let the corresponding matrices of these monomials be M1,,Ml. We have already shown how to prove the spider-nest identities corresponding to the Mj in Eq. (10.42), hence we can freely introduce them into the circuit of gadgets described by M:

Then, the indicator polynomial of L is fM + jmj = 0. Hence, every row in L appears an even number of times. Using gadget-fusion, we can therefore reduce all angles to integer multiples of π2. Hence, the entire diagram is then Clifford and can be rewritten into a Clifford circuit. □

We see then that when we restrict to just thinking about what we can do with diagrams, all the complexity of strongly 3-even matrices and degree n 4 Boolean polynomials reduces to just adding (S4) to the Clifford rules. Do note though that in the proof above we needed to know about Boolean polynomials and its decomposition into low-degree monomials in order to find which rewrites we should be applying to prove the spider-nest identity. In addition, the matrices corresponding to the monomials Mj might contain exponentially many rows, and hence this rewriting is not efficient. In fact, it seems very likely that an efficient rewrite strategy for spider nests should not exist (see Exercise 10.9). Forgetting about all these details again, we can see that we can rephrase this result into a completeness result, which very neatly ties in some of the earlier completeness results we have seen.

Theorem 10.3.9. The Clifford rules plus (S4) are complete for CNOT+T circuits. That is, given two CNOT+T circuits U and V written as ZX-diagrams, we can rewrite U into V using just the Clifford rules and (S4).

Proof. Note that we can trivially rewrite U to UV V by consecutively introducing pairs of cancelling gates from V and V . Hence, if we can show that UV can be rewritten to the identity we are done. In Section 7.1 we saw how we can write any CNOT+Phases circuit into a layer of phase gadgets followed by a CNOT circuit. Now since UV = I, it must be the case that both the phase gadget part and the CNOT circuit implement the identity. Hence, we can use the completeness of phase-free ZX (Theorem 4.3.6) to rewrite the CNOT circuit into the identity and Theorem 10.3.8 to rewrite the collection of phase gadgets, which necessarily forms a spider-nest identity, into a Clifford. This Clifford still must be equal to the identity, and hence by Clifford completeness (Theorem 5.5.7) it can be rewritten into the identity. □

10.4 Advanced T-count optimisation

We have now seen that there is a large number of configurations of T gates that actually correspond to Clifford circuits. Getting rid of non-Clifford parts of a circuit is usually a good thing, as we’ve seen that we can do a lot of rewriting and simplification with the Clifford parts of a circuit. In addition, as we will see in Chapter 11, the T gate in particular is quite costly to implement in many fault-tolerant architectures, and so we want to include as few of them as possible. The spider-nest identities suggest a simple rewrite strategy to optimise the number of T gates in a Clifford+T circuit. First, since spider nest identities apply to a collection of phase gadgets, and hence to CNOT+T circuits, we need to split up our Clifford+T circuit into CNOT+T subcircuits. We can view the Clifford+T gate set as consisting of CNOT, Hadamard, S and T. Of these, only the Hadamard is not in the CNOT+T set, and so we need to ‘split our circuit on Hadamards’. Then, pick a number of identities, preferably not containing too many gadgets and not acting on too many qubits. We want to see where in the CNOT+T circuit we can apply an identity so that it reduces the T-count. This is done as described in the beginning of Section 10.3: for each identity in our list, we check whether more than half of the gadgets are present in the circuit. If this is the case, then we add all the gadgets from the identity to the circuit, which by gadget fusion makes the matching gadgets already present in the circuit into Cliffords, and adds the other gadgets as non-Cliffords. This is repeated greedily until none of the identities has an overlap of more than half with the gadgets present in the circuit. Note that if we have n qubits in our circuit, then to match an m-qubit spider-nest identity, we need to check ( n m) different groupings of m qubits to see whether the identity ‘matches’ there. Hence, as long as m is bounded, the complexity of this algorithm is polynomial: O(nm). In practice, only checking spider nests with up to 5 qubits, and using a simple heuristic based on the ‘density’ of the number of gadgets on a set of qubits, the run-time can be made quite reasonable. Of course such an algorithm is only a heuristic, and is heavily dependent on the type of identities we include in our search, and since we are applying the identities greedily, you might get stuck in a local optimum. The problem of optimisation of CNOT+T circuits using spider nests is actually related to two well-known problems in computer science, which offer interesting and useful perspectives on this problem.

10.4.1 Reed-Muller decoding

We have seen that n-qubit spider nests correspond to Boolean polynomials of degree at most n 4. In addition, we have a correspondence between Boolean functions f and collections of spider nests specified by their parities as {y|f(y) = 1}. Let’s call the set of all n-bit Boolean functions Bn, and the set of n-bit spider-nest functions Sn Bn. Now, if we naively implement the set of gadgets corresponding to f by just implementing each of the gadgets in turn, then this will require |f| := y0f(y) number of T gates. We call |f| the Hamming weight of f. It is the number of 1s in f’s truth table. Note that we do not include f(0) in this sum, as this corresponds to the trivial phase gadget not connected to any qubit. However, we don’t have to naively implement f, because we know that all g Sn are actually free to implement: these correspond to Cliffords and hence do not require any T gates. Instead of directly implementing f, we can implement f g, which corresponds to applying the spider nest identity of g to the collection of gadgets of f. The cost of this implementation will then be |f g|. To implement f with as few T gates as possible we are hence looking for a g Sn such that |f g| is minimal. Let’s state this problem again in a slightly different way, and a bit more abstractly. We have a vector space V with a specified subspace S V . Given a vector v V , we want to find the closest s S to v, i.e. such that v s is as small a vector as possible (in some norm). This is known as a linear decoding problem, where S is our code space consisting of code words s S, and v is our ‘noisy message’ we are trying to decode. We will spare you the details for now, as we will have a lot more to say about linear codes in Chapter 11. In our case the subspace Sn consists of the spider-nest functions, which we know to be all the degree n 4 Boolean polynomials. This code space actually has a name: it is the degree n 4 Reed-Muller code, and hence optimising the T-count of a CNOT+T circuit corresponds to decoding the Reed-Muller code. To summarise this optimisation approach now a bit more concretely: We start with a unitary U that is built out of phase gadgets with phases that are multiples of π 4 . We take its corresponding Boolean function f Bn. We then find the ‘closest’ degree n 4 polynomial g Sn such that |f g| is as low as possible (corresponding to decoding the Reed-Muller code). We then implement the circuit U corresponding to f g by composing its phase gadgets. We know that U is equal to U up to some Clifford. We find the Clifford C such that U = UC (see Exercise 10.6). Then UC is our new circuit, and this has T-count |f g|. Reed-Muller codes are used a lot in practice and their decoding problem has been extensively studied. So in principle we could use such a decoder to optimise the T-count of a circuit. However, the codes that are used in practice mostly have a size, corresponding in our case to the number of qubits n, that is not too large. However, we don’t want to restrict to just small n, so that is a problem. Decoding Reed-Muller codes for large n is believed to be a hard problem, so we don’t expect there to be an efficient algorithm to optimally minimise the T-count of a CNOT+T circuit. For optimising the T-count of general Clifford+T circuits (that are allowed to contain Hadamards), there is a simple argument to see that T-count optimisation is NP-hard.

Exercise* 10.9. In this exercise we will work towards a simple proof that determining whether a Boolean function is satisfiable reduces to T-count optimisation of general Clifford+T circuits, and hence is T-count optimisation is NP-hard. Let f : 𝔽2n 𝔽2 be some Boolean function, which is described as some poly-size Boolean expression consisting of AND, XOR and NOT terms. We we want to determine whether there is a x 𝔽2n such that f(x) = 1. We have seen in Chapter 9 how we can construct Uf, the (n + 1)-qubit unitary acting as Uf|x,y = |x,y f(x), using Clifford+T gates.

1..

Let the circuit Cf be defined as follows:

(10.43)

Show that Cf is a diagonal unitary which can be described by the following path-sum expression:

Cf|x,y = eiπ 4 (12y)f(x)|x,y.
(10.44)
2..

Show that if f is not satisfiable or everywhere satisfiable (meaning f(x) = 1 for all x) that then Cf is a Clifford unitary (up to global phase) and hence can be implemented with T-count zero.

3..

If f is satisfiable but not everywhere satisfiable, then by definition there exist z1 and z2 such that f(z1) = 1 and f(z2) = 0. Then it is easy to see from Eq. (10.44) that

Cf|z1,0 = eiπ 4 |z1,0andCf|z2,0 = |z2,0.

Show that in this case Cf is not Clifford and hence it’s T-count non-zero, by finding a Pauli string P such that CfPCf is not in the Pauli group. Hint: You don’t have to calculate the full operator CfPCf. For the right choice of P it is enough to observe that CfPCf|z1,0 maps |z1,0 to something that a member of the Pauli group could not.

Now, note that if we could efficiently determine the optimal T-count of any circuit, then for a given f we could construct Cf and ask whether it’s T-count is zero: if it is not then we know it has to be satisfiable. If it is zero, then either the circuit is not satisfiable or everywhere satisfiable. We then just check the value f(00) to see which is the case. We can hence in either case efficiently determine whether f is satisfiable, an NP-complete task.

10.4.2 Symmetric 3-tensor factorisation

There is another way we can formalise the optimisation of the number of T gates in a diagonal CNOT+T circuit. To do this, we again need to consider the multilinear decomposition of a phase gadget as in Eq. (10.31):

eiπ 4 x1xn = exp (i (π 4 jxj π 2 i<jxixj + π i<j<kxixjxk))
(10.45)

In particular, the action of an arbitrary collection of π 4 phase gadgets can be represented by some degree-3 multilinear polynomial

π 4 jajxj + π 2 i<jbijxixj + π i<j<kcijkxixjxk.

Here the coefficients aj can be taken modulo 8, bij modulo 4 and cijk modulo 2. By Exercise 10.5 two collections of phase gadgets correspond to the same polynomial if and only if they implement the same linear map. Similarly, two phase gadget circuits are equal up to a Clifford when the coefficients of their polynomials have the same parities aj(mod 2),bij(mod 2),cijk(mod 2). Hence, if we don’t care about the Clifford part of a computation, we can forget that the coefficient aj should be taken modulo 8, and instead take it modulo 2. Similarly we can take bij modulo 2 instead of 4. This is nice, because this information about the coefficients modulo 2 can be captured in a single object.

Definition 10.4.1. An n-dimensional binary 3-tensor is an element S 𝔽2n3 where we label the components of the vector by Sijk for indices 1 i,j,k n. We say S is symmetric when Sijk = Sjik = Sikj = Skji = Skij = Sjki for all indices i,j,k, i.e. when S is invariant under permuting its indices.

We define the symmetric 3-tensor Sijk corresponding to a degree-3 multilinear polynomial by setting

Siii = ai Sijj = bij Sijk = cijk

for i < j < k. All other coefficients of S are then completely determined by symmetry. It is clear that from any symmetric 3-tensor we can also read of the coefficients of a degree-3 multilinear polynomial, which then corresponds to a phase gadget circuit. Note though that if we start with a phase gadget circuit, find its 3-tensor, and then translate it back into a phase gadget circuit, that we do lose some Clifford information in the process, and the resulting circuit is only equal to the original one up to some Clifford. Let’s look at what the 3-tensor of a single phase gadget looks like. Let y 𝔽2n describe the connectivity of the phase gadget. Then by Eq. (10.45) the corresponding multilinear polynomial has coefficients ai = yi, bij = yiyj and cijk = yiyjyk. That is: there is a T gate on all the qubits the gadget is connected to, a CS on all pairs of qubits it is connected to, and a CCZ on all triples of qubits it is connected to. This means the 3-tensor corresponding to the gadget Sy has a particularly simple form: Sijky = yiyjyk. 3-tensors that have this form are said to have rank 1. When we have a set of gadgets y1,,yk, the circuit has the corresponding tensor S = Sy1 + + Syk , and hence it is a sum of rank 1 tensors. Conversely, any way to write S as a sum of rank 1 tensors corresponds to a way to implement it as a series of phase gadgets.

Definition 10.4.2. A symmetric 3-tensor is rank 1 when it is of the form Sijk = yiyjyk for some vector y. For a general symmetric 3-tensor S its rank is the minimal number of terms needed to write S as a sum of rank 1 tensors, and we call such a sum a decomposition of S.

We see then that we have the following strategy for optimising the T-count of a phase gadget circuit: first find its corresponding 3-tensor. Then find a decomposition of this tensor into as few rank 1 tensors as possible. These rank 1 tensors directly correspond to an optimised set of phase gadgets which implements the same linear map as the original circuit up to a Clifford. Find what Clifford this is using the procedure of Exercise 10.6. The resulting T-count of the circuit is exactly equal to the rank of the decomposition we found. In particular, determining the optimal T-count of a given phase gadget circuit is equivalent to determining the rank of a symmetric 3-tensor. Unfortunately, determining the rank of a (symmetric) 3-tensor is believed to be a hard problem. Fortunately, there are some good heuristics that try to find low rank decompositions that work well in practice. We will say a bit more about these in the References of this chapter.

10.5 Catalysis

We saw way back in Section 3.3.1 that we can implement a T gate by injecting the |T := T|+ magic state into the circuit:

This is useful, because it is sometimes difficult to directly implement the T gate (as we will see in the next chapter), and instead having the ability to prepare magic states ‘offline’ and inject them when needed is preferable. However, in this circuit we consume the magic state when we inject the T gate. Wouldn’t it be nice if we could preserve this state instead? This does turn out to be possible, if we use a different injection circuit that contains some other non-Clifford gates. In particular, using a CS gate and a single |T state, we can apply a T gate and get the starting |T state back:

(10.46)

Here we used the decomposition of the CS gate written using an H-box (Section 9.2) into elementary gates:

(10.47)

Just using Clifford operations and CS gates, it is not possible to construct a T gate. We can see this, because the matrices produced by the Clifford+CS gate set have entries in a ring that does not include eiπ 4 . However, with Eq. (10.46) we see that as soon as we have just one |T state available to us, we can use CS gates to apply as many T gates as we want:

This is an example of what we call catalysis: a process that needs some resource to be present, but doesn’t consume that resource. In this case the resource is |T and the process is the implementation of a T gate using a CS gate. Another example of catalysis is using a |CCZ := CCZ|+ + + magic state, Clifford operations and a single T gate in order to get 3 |T states out:

(10.48)

So again, if we can perform CCZ and Clifford gates and have just a single |T available, then we can inject as many T gates as we want. There are three things we can do with catalysis of |T magic states that we will explore in this section. First, in some cases it turns out to be easier or cheaper to perform a CS or CCZ gate than a T gate, and then these methods allow us to save resources. Second, they allow us to prove that some gate sets are already computationally universal, even if they are not (obviously) approximately universal for unitary synthesis. And third, catalysis gives us a nice way to extend complete graphical calculi to larger domains.

10.5.1 Catalysis as a resource for compilation

In this section we will see how catalysis can be used to derive an efficient way to implement small angle rotations. To do that we first need to generalise Eq. (10.46) to allow us to implement controlled-phase gates. To see how this works it will be helpful to first write Eq. (10.46) in circuit notation:

(10.49)

Here we wrote a slightly more general circuit where we replace the T and controlled-S gates with Z(α) and controlled-Z(2α) gates. As a shorthand we write |Z(α) := Z(α)|+ as a generalisation of |T = T|+. Since this is a circuit equality that holds on the nose (with a correct global phase), it should continue to hold when we add additional control wires:

(10.50)

We can prove this is correct using some H-box rules (see Figure 9.0):

Exercise 10.10. Applying Eq. (10.50) to implement a CS gate requires using both a Toffoli and a CCZ. It turns out there is a different construction that only requires a single CCZ and some real Clifford operations (those that do not contain numbers with an imaginary part like CNOT, Hadamard, CZ, X). Prove the following identity:

Hint: First prove the identity for a = 0, and then show for a = 1 how the phase can be pushed through the circuit to cancel

Because we can apply catalysis equally well to controlled phases, we can start iterating the procedure producing bigger and bigger controlled-phase gates, where the phase being controlled is also increasingly large. For instance, if we want to implement a T gate, we can do the following:

(10.51)

Here in the last step we are left with a controlled Z2 operation. But since Z2 = id this does not do anything and we can remove it. So at this point we can stop the iteration of the catalysis. We see then that we can implement a T gate just using multiple-controlled Toffoli gates, if we have the right catalysis states lying around. This procedure works to implement any Z(2π2k) gate: we then get a ladder of k Toffoli gates. We have actually seen such a Toffoli ladder before: in Exercise 9.15 we saw that this is actually a controlled-decrementer circuit that decreases the value of an n-bit number by 1, controlled on the top wire. In that same exercise we saw that we can build a subtraction circuit if we make a ladder of these controlled-decrementers. For this reason, when we apply a subtraction circuit to a collection of catalysis states, this implements phase gates on on the top qubits:

(10.52)

The reason this is nice, is because in Section 9.5 we found a very efficient construction of the adjoint of the subtraction circuit: the adder. So if we can transform Eq. (10.52) slightly, so that it uses an adder instead of a subtracter, this gives us a way to efficiently implement a whole collection of phase gates at once. The way we do this is by taking Eq. (10.52) and composing both sides on the right by Add = Sub, and on the left by (T S Z). After cancelling with the adjoints we are then left with the following equation:

(10.53)

We showed the construction here for 3 bits, but this works for any number of bits n, in which case the smallest phase we implement is Z(2π2n). While this is nice and all, this might not seem immediately useful: we have this pattern of phase gates appearing in parallel, where we have a small-angle phase gate, a slightly-large-angle one, and so on. You might wonder, surely it will not happen often that we can use this exact pattern of phases in a real quantum circuit, and you would be wondering right. However, with the magic of ancillae we can pick and choose exactly which phases we want to appear and where. We can transfer the application of a phase gate to a zeroed ancilla:

(10.54)

Now when we have a complicated phase, we can decompose it into simple components, and put each of these onto its own ancilla. Suppose for instance we want to implement the phase Z(11 8 π). We can then write 11 bitwise as 1011 so that Z(11 8 π) = Z(2π24(23 + 21 + 20)). We can then put each of these component phases onto their own ancilla to get:

(10.55)

We have here also sneakily added a zeroed ancilla that gets a Z(π 2 ) applied that does nothing. We need this qubit though to complete the pattern: we see then that we get the right shape needed to use Eq. (10.53). However, note that Eq. (10.53) has adjoint phases, instead of the actual phases we need. There are multiple ways we can deal with this. One way is to realise that for phase gates, the adjoint is the conjugate: T = T¯. Hence, if we take the conjugate of both sides of Eq. (10.53) we do get the right phases. Since the Adder is a real matrix, this stays the same, but the states needed for the catalysis also flip: |T¯ = |T. We then have everything we need to produce the circuit we are after:

(10.56)

Well, that certainly seems like overkill. Why would we go through all this trouble just to prepare a single phase rotation. Well, it turns out that in a fault-tolerant setting we can’t just go and do whatever gate we want to do. We are restricted to just a small set of gates we can (cheaply) implement. So if our computation requires us to do some phase rotation on a weird angle, we have to find a way to this with the gates that we have access too. One way to do this would be to approximate the phase rotation by combining together unitary Clifford+T gates like in Section 10.1.2. But as we have now seen, another way to do it is to prepare just a single copy of each |Z(2π2k) state to serve as a catalyst which can be reused, and then apply some CNOTs and an adder. Because the catalysts can be reused, the asymptotic cost of this procedure is just the cost of the adder and the CNOTs. Let’s calculate more accurately what the cost then is. Suppose we want to implement a phase gate with angle α up to a precision 𝜀. We then find the smallest n such that 2n < 𝜀. We can then approximate α by a phase α^ = a2π2n where a < 2n is an integer such that |α α^| < 2n < 𝜀. It hence suffices to implement α^ instead. Since a is an n-bit number, we can implement the Z(α^) gate using a generalisation of Eq. (10.56) to n bits. We saw in Section 9.5 that we can implement an n-qubit quantum adder using n Toffoli gates. In a fault-tolerant architecture the implementation of these Toffoli gates is what dominates the cost. As 2n < 𝜀 we have log21𝜖 < n, and hence we can express the cost also in terms of the error budget, and say that we require log21𝜖 Toffoli gates. Decomposing each Toffoli with 4 T gates we see that we can equivalently say that the cost is 4log21𝜖 T gates.

10.5.2 Computational universality via catalysis

Using catalysis we can replace any occurrence of some gate (likeT) with some other gate (like CS), as long as we have access to some special catalyst ancilla state. We can use this idea to prove that certain gate sets are also universal for quantum computing. In this section we will demonstrate this idea by showing the Clifford+CS gate set is universal. This notion of universality we will use is however not the approximate universality like that of the Clifford+T gate set we demonstrated in Section 10.1.2. Instead it is what we will call computational universality. Approximate universality requires that we can approximate any unitary and hence quantum circuit arbitrarily well. But such a strong condition is actually not needed for a gate set to be useful. It is sufficient if we can just simulate the run of an arbitrary quantum circuit using some runs of a quantum computer using the restricted gate set. Let’s work through an example to make this more clear. For this section we will say that the purpose of a quantum computer is to estimate the expectation value of some observable O. We start with some state |ψ, apply some unitary U to it, and then do measurements and post-process these measurements to get an estimate of O. After many such runs we will get a close approximation of O. Mathematically we can represent the exact expectation value as:

(10.57)

However, when we are trying to estimate this observable, we don’t have to do this with just a single quantum circuit we run over and over again. Instead we could have a collection of different quantum circuits V j (potentially acting on a different number of qubits), input states |ψj, and observables Oj, such that taking a particular weighted average gets us the outcome we are after:

(10.58)

where here we define Oj to be the expectation value of Oj with respect to V j and |ψj. We see then that if we can estimate each of the Oj, that we can also estimate O itself, by just summing our estimates like O = jλjOj. This might seem like quite a hypothetical situation, so let’s give a concrete example. Suppose we have a Clifford+T circuit C applied to the input state |ψ. Then we can transform C into a circuit C containing just Clifford gates and CS gates using catalysis, so that C|ψ|T = C|ψ|T. If we were trying to estimate the observable O we can then check that:

(10.59)

So instead of just running the circuit C, we can run C, which doesn’t contain any T gates. This is then an example of Eq. (10.58) where the sum is over just one term and we have λ1 := 1, U1 := C, |ψ1 := |ψ|T and O1 := O I. But now suppose we don’t even want to use that single |T we need for the catalysis. What we can do then is decompose this magic state into a sum of Clifford states. Because each term in the sum needs to retain the form of an expectation value like (10.57), we can’t just decompose |T into pure states |ϕj, instead we need to decompose |TT| into a sum of |ϕjϕj| density matrices. One way to do this is the following:

|TT| = 1 2 ( 1 e4 e4 1 ) = 1 2 ( 1 1+i 2 1i 2 1 ) = 1 2|++| + 1 2|ii|2 1 2 (|00| + |11|). (10.60)

Hence, we can decompose |TT| into four Clifford states |ϕ1 = |+, |ϕ2 = |i, |ϕ3 = |0 and |ϕ4 = |1 with weights λ1 = λ2 = 1 2 and λ3 = λ4 = 21 2 . Starting with the left-hand side of Eq. (10.59) we then have:

(10.61)

We see then that this is a case of Eq. (10.58) with |ψj := |ψ|ϕj and Oj := O I and Uj = C for all j {1,2,3,4}. While we can reduce the calculation of an expectation value to the calculation of a sum of (potentially simpler to calculate) expection values in this way, there is an important issue here that we have however glossed over. We can only ever estimate the expectation value, not get an exact value. Generally, we want to determine an error budget for how close we want the estimate to be, and then that determines how many times we need to sample from the quantum computation. Since we are summing together different expectation values, we need to be careful that we aren’t blowing up the error in the estimates. Suppose for instance that some λk = 100. Then a small error in our estimate of Oj will blow up by a factor of a 100. On the other hand, if λk = 1100, then any error will also be decreased by a factor of a 100, so that even a large error is not that important. In you want to be efficient and not over-sample a given expectation value, so that we get its estimate at just the right target precision we then need to sample Oj a number of times proportional to |λj|. We can calculate that this summing approach gives a total overhead in the number of samples of |λj| compared to just determining the desired expectation value O with the original circuit. For instance, in the above example where we decomposed |TT| into four terms, we have j|λj| = 22 1 1.83. Hence, if we decompose the magic state in this way we need to collect 1.83 samples more than we would have needed to if we did use the magic |T state directly. Summarising the full procedure we see then that we need to do the following:

1..

Start with the Clifford+T computation you want to calculate.

2..

Replace all T gates by a CS gate catalysis circuit using a |T.

3..

Replace the |T state needed for all the catalysis by the Clifford states |ψj.

4..

Run each of the resulting four circuits a number of times proportional to |λj|.

5..

Combine the resulting estimates of the observable by scaling by λj to get the final outcome.

When we have Clifford gates and CS gates, the gate set is generated by CNOT, Hadamard, S and CS. Of course, CNOT can be constructed using CS and Hadamard, and if we allow states to be prepared into |0 and |1, then we can also prepare an S using a CS. Hence, this gate set is equivalent to just the CS and Hadamard gate. We see then that we have proven the following.

Theorem 10.5.1. The CS+Hadamard gate set is computationally universal. In particular, a Clifford+T computation can be simulated by a CS+Hadamard computation with a linear overhead in the number of samples, qubits and gates needed.

Remark 10.5.2. Our decomposition (10.60) of |TT| is a stabiliser decomposition, a concept we also looked at in Section* 7.8.1. But as we saw here, the simulation overhead was not based on the number of terms in the decomposition, but rather on the weight j|λj| = 1.83. This value is known as the stabiliser extent, or equivalently, the robustness of magic of the decomposition. Without using any catalysis, we could have chosen to write each of the T gates as a magic state injection, and then replace each of the |TT| states by its stabiliser decomposition. When we do this however, the stabiliser extent scales as 1.83t where t is the number of T gates, so that the simulation overhead becomes exponential in the number of non-Clifford gates. We expect such an exponential dependence, since replacing all the T gates gives us a Clifford circuit, and we don’t expect this gate set to be computationally universal. Note that this however does give us a classical simulation technique: write a Clifford+T circuit as a Clifford circuit where each T gate is replaced by a magic state injection using the |ϕj Clifford states, and then efficiently classically simulate each of these Clifford circuits. The cost of this method is then roughly O(k(n + t)31.83t) where n is the number of qubits, and k is the total number of gates in the circuit.

Remark* 10.5.3. We haven’t actually given a formal definition of what ‘computational universality’ really is. There are multiple ways we could define it that all differ in the details. One particular way we could define it, which we could also call ‘BQP-completeness’ is as follows: for a given gate set G define the complexity class BQPG as the types of decision problems that can be solved with high probability by a quantum computer just using gates from G. Then G is BQP-complete if PBQPG = BQP. That is, if a classical computer that can query a quantum computer using gates from G can solve the same problems in polynomial time as a universal quantum computer.

Because we can also catalyse T gates using a CCZ gate, we can also prove a version of Proposition 10.5.1 for the Clifford+CCZ gate set, showing that Clifford+CCZ is also computationally universal. In fact, we can restrict the gate set a bit more, as Toffoli+Hadamard is itself already computationally universal. This however requires a different argument then we have been using here, and hence we leave this for the advanced section 10.7.4.

10.5.3 Catalysing completeness

We saw in Theorem 10.3.9 that the (S4) spider nest rule combined with the standard ZX rules we have been using throughout the book is enough to get a complete set of rules for CNOT+T circuits. However, this extended rule set is not enough to prove all identities of ZX-diagrams in the π 4 -fragment. For instance, a simple identity that cannot be proven is the following rule known as supplementarity:

(10.62)

Formally showing that this cannot be proven using the ZX+(S4) rules is difficult, but intuitively we can see this because none of the standard ZX rules have any special behaviour for π 4 phases, while all the spider nest identities only deal with at least 15 π 4 phases, so there is nothing that says anything about the pair of π 4 phases here. Finding a rule set and then proving it is complete is usually a difficult challenge. However, using catalysis it turns out we can make our lives much simpler, and simply extend an existing complete calculus. Recall that in Chapter 9 we introduced a new type of generator for ZX-diagrams we called H-boxes. In particular we found a number of rewrite rules for phase-free H-boxes in Section 9.2.2, summarised in Figure 9.0. We remarked there that when we restrict to spiders with 0 and π phases, together with phase-free H-boxes, that these rules give us a complete calculus for postselected Toffoli-Hadamard circuits. Such diagrams correspond to a specific type of matrices. Namely, all such matrices are of the form ( 1 2)kM where M is an integer matrix. Hence, the matrix entries are from a subset of the ring [12]. As these matrices are just integer matrices up to a global scalar of 2 we will call this the fragment. Note that in particular there are no complex numbers in this calculus. Using catalysis we can make the set of matrices this calculus can represent larger in a ‘controlled way’ where we can also see which rules we need to add to preserve completeness. To see how this works, we want to first generalise the T gate catalysis of Eq. (10.46). First, as our goal will just be to produce states, we can plug |+⟩ into the top wire of Eq. (10.46). We can then simplify the expression to a more symmetric form:

(10.63)

We can then identify the underlying reason this catalysis works. It is because:

(10.64)

Here we suggestively used Eq. (9.15) to write the phases as H-boxes. We do this because such a rule doesn’t just hold for an H-box with a label that is a complex phase like e, it in fact holds for any complex a0:

(10.65)

This then allows us to write down a generalisation of Eq. (10.63) to arbitrary H-boxes:

(10.66)

Here we wrote a2 in the 2-ary H-box instead of a so that we don’t have to work with square roots. When we take a = eiπ 4 we get Eq. (10.63), but this works for any value. A particularly simple, but still interesting case is when a = eiπ 2 = i. Translating this back into circuit form gives us a catalysis of |i := |0 + i|1 states using a CZ. While this might seem trivial (it is after all provable using Clifford rewrite rules) in the context of the phase-free H-boxes it allows us to add complex numbers to the fragment. Namely, we can just add as a generator diagram of H-box-i, a single-ary H-box with label a = i to the calculus, and this in fact turns out to be sufficient to then represent any matrix 1 2kM where M now has entries in [i]. We will call this the [i] fragment. However, just getting this universality was the easy part. How do we know what new equations to add to this calculus to make it complete again? Generally, proving completeness is very difficult, as you first need to search for new equations, and then show that those equations are sufficient to prove all true equalities. However, as it turns out, adding the rule Eq. (10.66) for a = i to the already existing rules for the label-free fragment is already almost enough to get a complete calculus for this bigger fragment [i] which includes diagram of H-box-i. To see this, let’s first consider what a generic diagram in the [i] fragment looks like. We added the generator diagram of H-box-i, so now a diagram consists of generators from the old phase-free fragment plus this new generator. These generators could all combine to give us really complicated rewrites, so we want to prevent them from combining. Using Eq. (10.66) we can reduce all these separate instances of diagram of H-box-i into just one of them, reducing the complexity of the diagram. That is, given some diagram D in the [i] fragment, we can rewrite it to a diagram D containing just generators from the fragment such that:

(10.67)

Or, as it turns out, this is possible for most diagrams in the [i] fragment (Can you see for which ones it doesn’t work? If not, don’t worry, the authors of this book also originally forgot about this case. As we said: completeness is hard). We will talk about the failing case later, but for now let’s assume that our diagram satisfies Eq. (10.67). As a shorthand, we will write D[|ψ] for the diagram we get when we plug |ψ into the bottom input in Eq. (10.67). So here we have diagram of H-box-i-1. Note that diagram of H-box-i-2. Hence, if we expand it like this we see that D is equal to a sum of two diagrams: D where we plugged in |0 into the bottom wire, and iD where we plugged |1 into the bottom wire: D = D[|0] + iD[|1]. Now suppose we have two diagrams D1 and D2 in the [i] fragment and that they implement the same linear map: D1 = D2. We can both decompose them as described above to get D1[|0] + iD1[|1] = D2[|0] + iD2[|1]. Each of these Dj[|x] diagrams represents a matrix that is entirely real, so the only way for this equation of complex matrices to hold, is if it holds for the real part and for the complex part separately:

D1[|0] = D 2[|0]D 1[|1] = D 2[|1]
(10.68)

We then conclude that D1 and D2 are equal when we input either |0 or |1 into the bottom wire. As these states form a basis, this must then hold for any input. We can then leave this wire open and still have an equality:

(10.69)

Now we are in business! We have this equality as linear maps, but both diagrams are in the fragment for which we have completeness. We hence know how to rewrite one into the other. This gives us then a path to rewrite the original D1 into D2:

(10.70)

Here each equality is now a diagrammatic equality, and with (*) we denote we are using rewrites from the original complete calculus for the fragment. We have then very easily proven completeness for this larger fragment, all made possible using a single rule about catalysis. Well..., we would have proven completeness, if it were true we could always rewrite a diagram in the [i] fragment as in Eq. (10.67). We however made a hidden assumption: that there is at least one generator diagram of H-box-i-3 present in the diagram. When that is the case we can use Eq. (10.66) to reduce all these instances of the generator to just a single copy. But if the diagram didn’t contain any diagram of H-box-i-3 to start with, then this rule does not apply. In fact, we currently have no rewrite rules that relate a diagram containing a diagram of H-box-i-3 to one that does not contain any diagram of H-box-i-3 . This means in particular that our current rule set cannot prove the following true equation:

(10.71)

However, when we also add Eq. (10.71) as an additional rule, then this problem is solved and it is true that we can then always rewrite a diagram in the [i] fragment as in Eq. (10.67): if the diagram contains at least one diagram of H-box-i-3 we can already use Eq. (10.66) to transform to the form of Eq. (10.67), and if it doesn’t we use Eq. (10.71) once to introduce one diagram of H-box-i-3 , in which case it is also in the form of Eq. (10.67).

Proposition 10.5.4. The fragment of Z- and X-spiders with 0 and π phases and phase-free H-boxes, augmented with the diagram of H-box-i generator, the catalysis rule Eq. (10.66) for a = i, and the rule diagram of X-H-i-empty is complete for matrices of the form 1 2kM where M has entries in [i].

This trick for extending the calculus doesn’t just work for i: it works for any complex number a0 such that a2 . Let’s for example take a = 3. We can then do all the steps as before, translating a diagram D containing an arbitrary number of the H-box with label 3 into a diagram D in the fragment which just requires a single input of the 3 H-box: D = D[|0 + 3|1] = D[|0] + 3D[|1]. If we then have an equality between two diagrams D1 and D2 in the [3] fragment, we get D1[|0] + 3D1[|1] = D2[|0] + 3D2[|1]. Because each of the component diagrams only contains integers, this equation can only hold if the integer part and the 3 part hold separately. We hence again get two equalities D1[|0] = D2[|0] and D1[|1] = D2[|1], which allows us to conclude that D1 = D2 with the wire left open. We can then use a modified version of Eq. (10.70) to conclude that we have completeness. We also have a modified version of Eq. (10.71) that continues to be true: diagram of X-H-sqrt3-empty . Adding the catalysis rule and this scalar rule then gives us a complete calculus for the ring [3]. Although this covers many possible extensions, it does not cover one we care about: extending it with a T gate. This is because taking a = eiπ 4 we see that a2 = i. However, it turns out we can just iterate the catalysis procedure to get larger and larger calculi. Starting now with the calculus for the [i] fragment, we can add the H-box with the label eiπ 4 and add its catalysis rule. When we then go through the motions of the completeness proof again we will end up at the equation D1[|0] + eiπ 4 D1[|1] = D2[|0] + eiπ 4 D2[|1], where now each of the diagrams Dj[|x] has entries in [i] instead of . Luckily for us, eiπ 4 is still ‘independent’ of the entries of [i] so that again the only way for this equation to hold is if it holds for each component separately, so that the proof goes through without change. Since eiπ 4 = (1 + i)2, this calculus can represent arbitrary matrices with entries in the ring [i, 1 2].

Theorem 10.5.5. The fragment of Z- and X-spiders with 0 and π phases and phase-free H-boxes, augmented with H-boxes with a label of i and eiπ 4 and the catalysis rule Eq. (10.66) and scalar rule diagram of X-H-a-empty for a = i and a = eiπ 4 is complete for matrices with entries in the ring [i, 1 2].

Exercise* 10.11. Two copies of a eiπ 4 H-box can be used to represent an H-box with an i label. So instead of adding the i generator, we could only add the eiπ 4 generator. We then have to modify the catalysis rules: instead of having two separate ones, we need to stack them together into a single one. Find a modified catalysis rule that works in the label-free ZH-calculus augmented with just an H-box with a label of eiπ 4 and find which other rules you need to get completeness.

Exercise* 10.12. Prove supplementarity (10.62) using the phase-free H-box rules and the catalysis rules.

10.6 Summary: What to remember

1..

The Clifford+T gate set is approximately universal.

2..

In particular, it can exactly represent unitary 2n × 2n matrices with entries in the ring 𝔻[ω] = [1 2,eiπ 4 ].

3..

We can efficiently approximately synthesise single-qubit unitaries over the Clifford+T gate set. To achieve a precision of 𝜀 requires O(log1𝜀) number of gates.

4..

The scalable ZX notation allows us to represent large collections of parities as a single diagram. This is especially useful in representing large collections of phase gadgets.

5..

Certain collections of phase gadgets with phases that are multiples of π 4 correspond to Clifford unitaries or the identity. We call such collections of phase gadgets spider nest identities.

6..

A collection of gadgets represents an identity if its corresponding parity matrix is strongly 3-even, meaning that the Hamming weight of every column is a multiple of 8, of the product of every pair of columns is a multiple of 4, and the product of every triple of columns is a multiple of 2. The collection of gadgets is a Clifford if it’s parity matrix is 3-even, meaning that the three previous conditions only hold modulo 2.

7..

We can instead represent a collection of n-qubit gadgets by its indicator function. If this is a polynomial of degree at most n 4, then it is equal to a Clifford.

8..

Using this representation we can show that the standard ZX rewrite rules plus one additional rule (S4) suffice for completeness of CNOT+T circuits.

9..

Optimising the number of T gates in a CNOT+T circuit is equivalent to decoding a Reed-Muller code, or equivalently to finding a rank decomposition of a symmetric 3-tensor.

10..

We can relate gate sets involving the CCZ, CS or T gate together using the framework of catalysis, where we can interchange the role of one gate with another using a resource state that we call a catalyst. This catalyst is not consumed in the process and hence can be reused.

11..

Using catalysis we can find an efficient way to implement small-angle rotations, prove the computational universality of a gate set, and prove completeness by extending other complete rule sets.

10.7 Advanced Material*

10.7.1 Exact synthesis of Clifford+T states*

In this section we will take another look at the exact synthesis algorithm for Clifford+T unitaries described in Section 10.1.1, but now we will consider multi-qubit unitaries and fill in the number theory details. As we saw in that section, when we understand how to synthesise a state, an algorithm for synthesising a unitary follows easily, so let’s look at synthesising states first. So let’s suppose we have a normalised vector |ψ 𝔻[ω]2n . Our task is to find a Clifford+T unitary U such that U|ψ = |00. Writing |ψ = (ψ1,,ψ2n) we can represent the vector components ψi as ψi = aiω3 + biω2 + ciω + di where ai,bi,ci,di 𝔻. In Lemma 10.1.2 we saw that if these coefficients are integers, that then all the entries except one must be zero and hence |ψ is a unit vector. For a single-qubit state, the only possible unit vectors are |0 and |1, so if we got |1 we just apply an X gate to get it to be the |0 we want. However, now the state can be any |x up to global phase. We can map this to |0 by applying X gates wherever xi = 1. While this is fine if we are synthesising a state, this messes things up when we are synthesising this state as part of a bigger unitary synthesis routine where we care about many columns being sent to the right location. In that case we need to apply the appropriate 2-cycle classical gate (see Section* 9.7.2 for how these can decomposed into Clifford+T gates), to transform |x into |0, or whatever basis state we need it to be. We can also get rid of its phase ωk by applying the 1-level T[1]k gate that adds a ωk phase just to the |000 state. Because these 2-level and 1-level operations only change the basis states we are interested in, they do not mess up any of the other columns of the unitary we are synthesising. So as in the single-qubit case, if all the components in the vector are integers we are essentially done. We then just need to find a strategy to make the vectors be ‘closer to being integers’, i.e. elements of [ω]. The obvious metric for how far an element in 𝔻[ω] is from being an element in [ω] is the smallest power of 2 we have to multiply the element with to get an integer. However, this turns out not to be the best choice. This is because 2 is not a prime number in [ω]. The ‘magic number’ δ = 1 + ω we saw in Section 10.1.1 is a prime in [ω].

Definition 10.7.1. Let R be a ring and a R. We say a is a unit if there exists b R such that ab = 1. We instead say a is prime if a is not a unit or 0, and if for any decomposition a = bc with b,c R we have that either b or c is a unit.

Example 10.7.2. In the only units are 1 and 1, while in any field, like , every non-zero element is a unit. The primes of are precisely the prime numbers and their negations (since if a is prime, then multiplying a by any unit gives you another prime). In [ω] examples of units are ω, because ωω7 = 1, and 1 + 2, because (1 + 2)(2 1) = 1.

We can prove δ is prime in [ω] by defining a new kind of norm on [ω].

Exercise 10.13. On the ring [ω] we have a norm given by Nω(z) = zz¯. This norm has some nice properties, namely that it is multiplicative, Nω(z1z2) = Nω(z1)Nω(z2), and that it sends elements to positive elements of [2]. We can define a different norm with similar properties on [2]. For a + b2 [2] define the conjugate to be (a + b2) = a b2, and then define the new norm by N2(z) := zz.

a)

Show that the conjugate on [2] is multiplicative: (z1z2) = z1z2. Use this to show that the norm N2 is multiplicative.

b)

For z [ω] define N(z) := N2(Nω(z)) = (zz¯)(zz¯). Argue that N is also multiplicative, and that it maps all z to positive integers.

c)

Show that z is a unit of [ω] if and only if N(z) = 1. Hint: For the if direction the definition of the norm already gives you the inverse of z.

d)

Show that z is prime if N(z) is prime in .

e)

Calculate N(δ), N(2) and N(2), and conclude that N(δ) is prime, but the others are not.

So now we know that δ is prime while 2 (and hence 2) is not. But it turns out that δ is also a prime factor of 2 (and hence 2).

Exercise 10.14. Let δ = 1 + ω.

a)

Write δ2 and δ3 as a + + cω2 + dω3 for some integers a,b,c,d.

b)

Using the fact that ω + ω1 = 2, write δ2ω1 as x + y2 for some integers x and y.

c)

Define the unit λ := 1 2. Show that δ2ω1λ = 2.

So we see that 2 can be decomposed up to units into two copies of δ, and hence 2 can be decomposed into four copies. Hence, instead of considering the smallest power of 2 we have to multiply a number in 𝔻[ω] with to get something in our integer ring [ω], we instead consider the smallest power of δ, as this is a more finegrained metric. For a z 𝔻[ω], we call the smallest k such that δkz [ω] the least denominator exponent (lde) of z. For a vector of values |ψ 𝔻[ω]N, we call its least denominator exponent the smallest k such that δk|ψ [ω]N. Of course if the lde of a vector is 0, then it already consists of elements in [ω], and we know that such a normalised vector must be very simple. So if we can just find some procedure to iteratively reduce the lde to 0, then we are happy. The goal then is to find, for a given |ψ with lde k, a set of unitaries G1,,Gl such that |ψ = GlG1v has lde at most k 1. Then we could just repeat this procedure until we get to denominator exponent 0. Given a |ψ with lde k, we can define the vector |u := δk|ψ [ω]N. After making some modifications to |u by applying gates to get a |u [ω]N, we are interested in whether this modification has reduced the lde. In order to see when this is the case, we hence need to know when we can divide |u by δ, and still get a vector in [ω]N. Of course, when we start caring about divisibility by some number, we will need to talk about calculating modulo this number. So in the same way as we have been talking about parities, which are elements of modulo 2, now we are going to work with residues, which are elements of [ω] modulo δ. For elements x,y [ω] we will write x δy to denote that x y = for some a [ω]. For instance, in the exercise above we saw that 2 = (δω1λ)δ, and hence 2 δ0. It is not hard to see that δ is an equivalence relation, and that it is preserved by addition and multiplication: if a δb and c δd, then a + c δb + d and ac δbd. We then also have 2 δ22 δ0.

Lemma 10.7.3. For any z [ω] we have z δ0 or z δ1.

Proof. We have δδ0, and as δ = 1 + ω, we calculate then that

ω δω + 0 δω + 2 δδ + 1 δ1.

Hence for any j we have ωj δ1, so that a + + cω2 + dω3 δa + b + c + d. Since furthermore 2 δ0, we see that hence the residue of an element modulo δ is either 0 or 1. □

Given some |u = δk|ψ [ω]N our goal is to apply operations to |u to make all the components divisible by δ, and hence have zero residue. The components with residue 1 are then the ‘obstacles’ we want to get rid of.

Lemma 10.7.4. Let |ψ 𝔻[ω]N be a normalised vector with lde k > 0, so that |u = δk|ψ [ω]N. Then there are at least 2 components ui and uj of |u that have residue 1.

Proof. |u is divisible by δ iff uj δ0 for all j. But assuming that k was the lde of |ψ, then by definition it won’t be divisible, and so there will be at least one uj with non-zero residue. By normalisation of |ψ we have ψ|ψ = 1 and hence jujuj¯ = u|u = δkδ¯k δ0. Since residues are either 0 or 1, we then know that there are an even number of cases where ujuj¯ δ1. The residue is multiplicative, so for these j we must then also have uj δ1. Hence, if |u is not divisible by δ there must be at least a pair of elements ui and uj that each have non-zero residue. □

The fact that non-zero residues come in pairs is good, because it turns out we can only reduce the residue of elements of |u in pairs. We are working with Clifford+T gates. The CNOT, S and T only contain non-zero elements that are units in [ω] and have at most one non-zero entry per row, and hence applying these gates does not affect the residues of the state. The only gate then that can affect residues is the Hadamard H = 1 2 ( 1 1 1 1 ). We see that the Hadamard creates sums ui + uj and differences ui uj of elements of the vector, which can lead to lower lde, but then it also divides the elements by 2, which can increase the lde. Because 2 contains two powers of δ, we need to look at the vector elements modulo δ3 (the next power up), to see if applying a Hadamard will result in lower lde.

Exercise 10.15. We have δ3 = 1 + 3ω + 3ω2 + ω3. Show that any element in [ω] is equivalent modulo δ3 to an element in the set

{0,1,ω,ω2,ω3,1 + ω,1 + ω2,1 + ω3}.

Hint: First note that 2 δ30, so that we only have to deal with elements a + + cω2 + dω3 where a,b,c,d {0,1}. Then argue that when a = b = c = d = 1, the residue is zero, so that you only have to consider the cases where at most two of a, b, c or d are 1.

Note that from this above exercise we immediately get the following, just by checking all the possible cases:

Lemma 10.7.5. If z [ω] has z δ1, then z δ3ωj for some j {0,1,2,3}.

Now we have all the tools we need to solve the problem at hand. For simplicity, let’s again first assume we are dealing with a single-qubit vector |u = (u1,u2) [ω]2. If it is not already divisible by δ then we must have u1 δu2 δ1, since the non-zero residues come in pairs. Then by the previous lemma we have u1 δ3ωl and u2 δ3ωk. In other words: u1 = ωl + xδ3 and u2 = ωk + yδ3 for some x,y [ω]. Then we see that if we apply a Tlk gate to this vector that we get:

Tlk|u = Tlk ( ωl + xδ3 ωk + yδ3 ) = ( ωl + xδ3 ωk+lk + ωlkyδ3 ) = ( ωl + xδ3 ωl + yδ3 ) ,

where we have defined y := ωlky. Now comes the magic trick: we apply a Hadamard, and we use the fact that 2 = δ2ω1λ, and hence δ2 = 2ωλ1 where λ is the unit from Exercise 10.14:

HTlk|u = 1 2 ( 2ωl + (x + y)δ3 (x y)δ3 ) = ( δ2ω1λωl + (x + y)ωλ1δ (x y)ωλ1δ ).
(10.72)

We see now that every term has at least one factor of δ, so we can factor it out:

HTlk|u = δ ( δω1λ + (x + y)ωλ1 (x y)ωλ1 ) .
(10.73)

Success! Because this means that HTlk|u is divisible by δ. As |u = δk|ψ, this means that HTlk|ψ now has lde smaller than k. We can now repeat this procedure until we get to lde 0, in which case we know the vector we have is a basis vector, and we are done. This just covers the single-qubit case, but reducing the lde of a multi-qubit normalised vector |ψ 𝔻[ω]2n is done very similarly. Defining |u = δk|ψ, where k is the lde of |ψ, we saw that there must be an even number of elements of |u with non-zero residue. We can hence pick a pair (ui,uj) that both have non-zero residue, and then apply the above technique, just ‘targeting’ this pair to zero out their residues. We can do this targetting by replacing the Hadamard and T gates above with the 2-level and 1-level operators H[ij] and T[j] that hence only change the residues of the elements ui and uj. The constructions in Section 9.7.2 show how to implement these gates using just Clifford+T gates. We do this reduction of lde with every pair that needs it, until all the residues are zero, in which case the modified |u is divisible by δ. This then means that the modified |ψ has lower lde. We then just repeat until the lde is zero and we are left with a basis vector. Pfew, that was a lot, so let’s summarise what we have actually done to get to the solution:

1..

We started out with a 2n × 2n unitary U where all the entries are in the ring [ 1 2,i].

2..

Then we realised in Eq. (10.74) that we wanted to find a unitary G that reduces the first column of U to a standard basis vector.

3..

Finding such a unitary G is equivalent to finding a way to reduce an arbitrary normalised vector |ψ to |00 using G: G|ψ = |00.

4..

Instead of writing |ψ as a vector over [ 1 2,i], we write it as a vector over 𝔻[ω]. We find its least denominator exponent k: the smallest number such that δk|ψ [ω], where δ = 1 + ω. We picked δ as the base, since it is prime in [ω].

5..

We look for two components ui and uj of |u = δk|ψ such that the residues ui δuj δ1. If there is such a pair we apply 2-level Hadamard and 1-level T gates to zero out their residues.

6..

If there isn’t such a pair of components left, then we have transformed |u to be divisible by δ, so that the new |ψ we found must have lower least denominator exponent.

7..

We then repeat this procedure until |ψ has lde 0, in which case it is a standard basis vector up to a phase, which is easily permuted into the desired basis vector, and its phase removed by applying the appropriate 1-level gate.

8..

We have now found our desired unitary G1 that we can apply to U to simplify its first column. Because U is unitary, this means its first row must now also be simplified.

9..

Now just rinse and repeat for all the other columns of U, resulting in a series of Clifford+T unitaries G1,,G2n such that G2nG1U = I.

10..

We hence have U = G2n1G11.

10.7.2 Exact unitary synthesis*

Let’s fill in the details on how to exactly synthesise an entire unitary and not just a single state. We have a 2n × 2n unitary U with matrix entries in 𝔻[ω]. Let |u1 be the first column of U, and V 1 the Clifford+T unitary satisfying V 1|u1 = |0 that we can find using the procedure described in the previous section. Then we have:

G1U = ( 1 0 0 0 U 0 ).
(10.74)

Note that here the first row also becomes a unit vector, because of the orthogonality conditions between the columns of the unitary G1U. Now we can take the first column of the smaller unitary U and synthesise it as a state again. We have to be careful to not undo the work we did with reducing the first column of U, and hence we need to use 2-level and 1-level operators to only touch the elements of the matrix we want to. Repeating this procedure, we see that we get Clifford+T unitaries V 1,,V 2n satisfying V 2nV 1U = I. Hence, we have synthesised U as V 1V 2n. All of this is a bit reminiscent of the CNOT synthesis algorithm using Gaussian elimination of Chapter 4. However, there we could encode the entire function of the CNOT circuit into an n × n matrix, while here we are working with a 2n × 2n matrix. Hence, even if each V i is a small circuit, the overall circuit synthesising U might still be exponentially large. We would not expect to do any better as we are now dealing with an approximately universal gate set, and hence there are simply too many possible unitaries we can synthesise for all circuits implementing them to be small. Let’s record what we have now seen in a Theorem.

Theorem 10.7.6. Let U be an n-qubit unitary with entries in 𝔻[ω]. Then U can be realised by a Clifford+T circuit using at most one zeroed ancilla.

Proof. In these sections we have found a method to write U using 2-level Hadamard and X operators and 1-level T operators. As described in Section 9.7.2, these can be built using gates with n 1 controls, which require a single zeroed ancilla to be implemented over the Clifford+T gate set (cf. Section 9.4.2). □

These techniques we have seen for exact synthesis are not unique to Clifford+T. They work for many gate sets that can at least express the 2-level operators necessary to move elements of the vector to the place where they are needed. One particularly simple example of this is the correspondence between circuits of Toffoli, CNOT, NOT and Z gates, and unitaries over the ring . Since all the entries in such a unitary U are integers, the normalisation of the column means that there is at most one non-zero element and that this element is ± 1. Hence, ignoring the possible 1 phases, such a unitary is just a big permutation of the basis vectors, which we know we can realise using a Toffoli circuit V (Section 9.7.1). The resulting unitary V U is then diagonal and only has ± 1 phase. The 1 phases can be realised by 1-level Z operators, and then we are done!

Proposition 10.7.7. Let U be an n-qubit unitary with entries in the ring of integers . Then U can be realised by a quantum circuit consisting of Toffoli, CNOT, NOT, and Z gates, using at most one zeroed ancilla.

There are several other results like this that make a correspondence between a certain quantum gate set and the set of matrices over a given ring.

Theorem 10.7.8. Let U be an n-qubit unitary and let R be a ring such that all the matrix entries of U are in R. Then U can be synthesised as a quantum circuit over the gate set G using at most one zeroed ancilla when:

Hence, we see that the exact synthesis of Clifford+T circuits doesn’t exist in a vacuum, but is in fact part of a ladder of increasingly more powerful gate sets corresponding to larger rings. This ladder can be continued, by replacing the T = Z(π 4 ) gate by Z(2π 2k) for some k > 3. The ring that corresponds to the resulting Clifford+Z(2π 2k) gate set is 𝔻[ei2π 2k ].

10.7.2.1 Optimality of single-qubit Clifford+T unitary synthesis*

Using the exact synthesis algorithm for Clifford+T unitaries generally results in very large circuits. The exception is when we apply it to single-qubit unitaries, for which it is in fact optimal in the number of gates needed. As we already saw in Section 10.7.2.1, for a single-qubit unitary, the Clifford+T exact synthesis takes a particular nice form. We start with some unitary

U = ( ψ1 ψ2¯e ψ2 ϕ1¯e ) .
(10.75)

Then we find a unitary V built out of H and T gates that synthesises |ψ = (ψ1,ψ2) so that V |ψ = |0. Then

V U = ( 1 0 0 eiα )
(10.76)

for some α which is a multiple of π 4 . Hence we can further reduce V U to the identity by applying the appropriate power of T. Up to some small constant, synthesising a single qubit unitary hence costs just as much as synthesising a single-qubit state |ψ. The cost of synthesising |ψ is directly related to its lde k. To reduce it to lde 0 we need to apply for each reduction a Hadamard gate and some power Tm of the T gate where m = 1,2 or 3. If m1 we can see this as applying a T gate and/or an S gate. We hence get a sequence of H, T and S gates, where each application of the Hadamard reduces the lde by at least 1. The total number of gates is hence at most 3k, and the number of Hadamard gates and T gates is each at most k. You could wonder whether there is any way we could do better, but it turns out that this number of gates is actually optimal! Suppose we start from the state |0 and we apply Hadamard, S and T gates to it. Then we end up with some state that we can write as

|ψ = 1 δk ( x + z + ),
(10.77)

where k is its least denominator exponent. Assuming that k > 0 we see that x + δz + , as |ψ is normalised. Hence x δz. Furthermore, we necessarily have x δz δ1 since otherwise both expressions would be further divisible by δ contradicting k being the lde. Applying an S and T gate to this state doesn’t change the denominator exponent, but a Hadamard can change the lde. We calculate then:

H|ψ = 1 δk 1 2 ( x + z + (y + w)δ x z + (y w)δ ).
(10.78)

Since x δz, we see then that each of the components of the vector x + z + (y + w)δ and x z + (y w)δ are divisible by δ. As furthermore 2 is δ2 up to some unit ω1λ, we see then that

H|ψ = 1 δk 1 δ2 ( aδ bδ ) = 1 δk+1 ( a b )
(10.79)

for some numbers a and b. Applying a Hadamard can hence only increase the lde by at most 1. So if we got some state |ψ with lde k, then we know the circuit building it must contain at least k Hadamards. But our synthesis algorithm requires at most k Hadamards to synthesise it. Hence the lde is exactly equal to the optimal number of Hadamards needed to synthesise it. Or, well, this is almost true, because our argument above only holds if k > 0. If we have k = 0, so that |ψ is a unit vector up to some phase, applying a Hadamard increases the lde by 2. So the number of Hadamards is k 1. This is actually what our synthesis algorithm above also finds if we were to analyse it a bit more carefully, because it turns out that there are no normalised vectors that have lde 1, so that in Eq. (10.73) our lde would actually reduce by 2 if we had k = 2 to start with.

Exercise 10.16. Prove that there are no normalised vectors that have lde 1.
Hint: This corresponds to showing that there are no vectors |u [ω]N for which u|u = δδ¯ = 2 + 2. Now use Eq. (10.2) for the component norms |ui|2, and argue that the only possible solution to i|ui|2 = 2 + 2 is the one where there is a single non-zero component which is in fact divisible by δ.

To conclude we hence see that if we have a vector with lde k, then our synthesis algorithm requires an optimal number of k 1 Hadamards to synthesise it, and up to k T gates. This last part is because we can have a T gate after every Hadamard, but we can also have one appear before the first one.

10.7.3 Approximate single-qubit Clifford+T synthesis*

In Section 10.1.2 we hinted at how we can do approximate synthesis of arbitrary single-qubit unitaries using a Clifford+T circuit. In this section we will fill in some more details on how this works, though some parts are a bit too technical even for this advanced section, and so we will just refer to the References for details on how to do that. So we have some arbitrary single-qubit unitary we want to approximate using Clifford+T gates. First we recall that any single-qubit unitary can be written as Z(α)X(β)Z(γ) for some phases α,β,γ. Additionally, X(β) = HZ(β)H, so if we know how to synthesise Z(α) gates, then we can synthesise arbitrary single-qubit unitaries. Let’s assume then that our goal is to approximate a Z(α) gate for some arbitrary α using just single-qubit Clifford+T gates. First, it will be useful to work with matrices that have determinant 1, so we write our Z(α) and our approximating unitary U as follows:

Z(α) = ( e2 0 0 e2 ) U = 1 2k ( u t¯ t u ¯ )
(10.80)

Here u,t [ω]. Note this this means our definition of Z(α) is different from the one we have been using in this book by a global phase. Note also that we wrote our approximation unitary with a factor of 1 2k in front of it, instead of a power of δ. This turns out to be nicer for this algorithm. We will fix an error budget 𝜖 , and require U Z(α) < 𝜖. If we found a U with this property, then we know how to synthesise it using the algorithm we described in the previous section, and this synthesis will have an optimal number of Hadamards and T gates, given the lde k. So our goal then is to find a U within 𝜖 of Z(α), with as low a k as possible. Given that we know how to determine whether a solution exists for a given k, we can find the optimal value of k by just starting low and increasing it until we find a solution. Since k won’t be too big (it will be O(log1𝜖)), this will still be efficient. We can hence assume that k and 𝜖 are fixed, and then we need to determine whether a solution exists, and if it does, what this solution is. In practice the finding of the solution will also tell us whether there is a solution, so we will focus on that. Our goal then is to find a U as in Eq. (10.80) for some given k and α, such that U Z(α) < 𝜖 for some given 𝜖. In order to help us do that, we need to have a more concrete description of the norm U Z(α) in terms of the matrix elements.

Lemma 10.7.9. Let un = 1 2ku and tn = 1 2kt be the normalised versions of u and t, and define z = e2. Then:

U Z(α)2 = |u n z|2 + |t n|2

Proof. Note first that U Z(α) = I UZ(α) by the unitary invariance of the operator norm. Now, UZ(α) is unitary, and hence has two eigenvectors |ϕj with eigenvalues eiαj. Because U and Z(α) both have determinant 1 we have 1 = det(UZ(α)) = eiα1eiα2, and hence setting α := α1 we necessarily have α2 = α. The eigenvectors of UZ(α) are also eigenvectors of I UZ(α) and have eigenvalue 1 e±, so that I UZ(α) = |1 e|. Recall that the Hilbert-Schmidt norm of a matrix A is given by AHS2 := jA|ψj2 = jψj|AA|ψj where the {|ψj} form an orthonormal basis. It is a standard exercise in linear algebra to show that the Hilbert-Schmidt norm is independent of choice of orthonormal basis. Choosing the eigenbasis of UZ(α) we see that I UZ(α)HS2 = 2|1 e|2 = 2I UZ(α)2, as both eigenvalues have equal magnitude. Instead picking |0, |1 for our orthonormal basis, we calculate that U Z(α)HS2 = 2|un z|2 + 2|tn|2 by just evaluating the matrices. Putting these different expressions for the norm together, we calculate:

U Z(α)2 = I UZ(α)2 = 1 2I UZ(α) HS2 = 1 2U Z(α)HS2 = |u n z|2 + |t n|2

Now, using the fact that unun¯ + tntn¯ = 1 and zz¯ = 1, we can expand |un z|2 + |tn|2 further and simplify some more:

|un z|2 + |t n|2 = (u n z)(un z)¯ + tntn¯ = unun¯ unz¯ zun¯ + zz¯ + tntn¯ = 2 2(z¯un).

Here (z¯un) denotes the real part of the complex number z¯un. So if U is a solution to our approximation problem, then 2 2(z¯un) 𝜖2, or equivalently (z¯un) 1 𝜖2 2 . Interestingly, this does not depend on tn, but only un. Furthermore, if we write un = a + bi and z = x + yi, then we can interpret them as 2-dimensional real vectors un = (a,b) and z = (x,y), and then (z¯un) = un z is just a dot-product. We hence want to solve the following problem: given a 2-dimensional real unit-vector z, find a subnormalised vector un = (a,b), such that un z 1 𝜖2 2 , where a,b [ 1 2] and we allow some maximal power of 1 2 depending on k. We call this a grid problem, because we have some target region, the vectors whose inner product with z is very close to 1, and a grid of points, given by [ 1 2]2, and we want to find a point on this grid that is in the target region. It turns out that this problem is efficiently solvable, although describing in full how to do so is quite lengthy (see the References of this chapter). So let’s suppose that we can solve the grid problem, and hence that we get a un satisfying un z 1 𝜖2 2 . This then gives us the candidate u we are looking for. But we then still need to find a t, so that the matrix U defined as in Eq. (10.80) is in fact unitary. The only equation we need to satisfy for this matrix to be unitary is |u|2 + |t|2 = 2k, as this means that the columns of the matrix correspond to normalised vectors. We know the expression for a norm for an element in [ω], this is namely given by Eq. (10.2): for z = aω3 + bω2 + + d we have

|z|2 = z¯z = (a2 + b2 + c2 + d2) + (cd + bc + ab da)2.

Rewriting the equation |u|2 + |t|2 = 2k to isolate our unknown t we get |t|2 = 2k |u|2. But this right-hand side will now be some ζ = x + y2 for integers x,y. Writing t = aω3 + bω2 + + d, we can also expand the left-hand side. We then have a part of the equation that results in an integer, and a different part that results in an integer multiple of 2. These need to be independently satisfied, so that we can split the equation up into two parts:

a2 + b2 + c2 + d2 = x ab + bc + cd da = y

This is a pair of quadratic equations over the integers, and is hence a special case of a Diophantine equation. While these are in general hard to solve, in this particular case it turns out that a large proportion of them are fast to solve. So in practice, we can try a bunch of candidates until we find one for which our solution strategy works. Once the details about solving the grid problem and this Diophantine equation are filled in, this gives an algorithm that gives an approximation of the phase gate Z(α) using Clifford+T gates that uses an optimal number of Hadamard and T gates. The length of the circuit scales quite well with the desired precision. For instance, we can approximate Z(π128) to within 𝜖 = 1010 with a circuit containing 102 T gates. This algorithm turns out to be both efficient and optimal. However, this optimality guarantee only applies when we restrict to unitary ancilla-free circuits. If we allow ancillae, measurements or classical control it is possible to do better by some constant factors. The optimal number in the unitary ancilla-free case scales as 3log2(1𝜖) + C for some small constant C, while the best known protocol when we allow measurements and classical control scales as 1 2 log2(1𝜖) + C (see the References of this chapter).

10.7.4 Computational universality of Toffoli-Hadamard*

In this section we will show that the set of real-valued unitaries, i.e. where all matrix entries are real numbers, is computationally universal. Then we will show that in fact the restriction to just Toffoli and Hadamard is already computationally universal. First, obviously real-valued unitaries are not approximately universal as we can never approximate any complex-valued unitary (like the S gate). But as we saw in Section 10.5.2 for a different gate set, it turns out we can ‘simulate’ complex-valued unitaries using a real-valued one on a larger set of qubits. For a (complex-valued) n-qubit unitary U, let (U) be the real part of U. That is: (U)ij = (Uij). Similarly define the complex part (U). Then U = (U) + iℑ(U). Now define the (n + 1)-qubit unitary U~ via

U~(|0|ψ) := |0 ((U)|ψ) + |1 ((U)|ψ) U~(|1|ψ) := |0 ((U)|ψ) + |1 ((U)|ψ)

While it is clear that U~ is real-valued, it is not immediately obvious that it is unitary.

Exercise 10.17. In this exercise we will show that U~ is indeed unitary for any choice of U.

a)

Express (U) and (U) in terms of (U) and (U).

b)

Express (UV ) and (UV ) in terms of (U), (V ), (U) and (V ).

c)

Show that ((0|ψ|)U~)(U~(|0|ψ)) = ψ|ψ. Hint: You will need (UU) = (I) = I.

d)

Show that ((0|ψ|)U~)(U~(|1|ψ)) = 0. Hint: You will need (UU) = (I) = 0.

e)

Conclude that U~ is indeed unitary.

The encoding into U~ is also compositional, meaning we can apply it iteratively to a sequence of unitaries.

Exercise 10.18. Show that UV ~ = U~V ~. Hint: Use a case distinction on input states |0|ψ and |1|ψ.

Note that this construction is in fact an example of catalysis, as:

(10.81)

Exercise 10.19. Prove Eq. (10.81).

We can however not use the argument of Section 10.5.2 to use this to show real-valued unitaries are computationally universal as there is no way to write the |SS| catalyst state in terms of states that can be prepared by real-valued unitaries. We can however use a different argument to prove computational universality.

Proposition 10.7.10. Real-valued unitaries are computationally universal.

Proof. Suppose C is an n-qubit circuit built out of unitaries as C = U1Uk. We then build the real-valued (n + 1)-qubit circuit C~ by C~ = U1~Uk~. Then note that:

(0|ψ|)C~(|0|ψ) = ψ|(C)|ψ (1|ψ|)C~(|0|ψ) = ψ|(C)|ψ.

Hence, if we input |0|ψ into C~ and do a measurement marginalising over the first qubit we also get the probabilities:

x=0,1|x,ψ|C~|0,ψ|2 = |ψ|(C)|ψ|2 + |ψ|(C)|ψ|2.

Now, we can also, with some effort, calculate that |ψ|C|ψ|2 = |ψ|(C)|ψ|2 + |ψ|(C)|ψ|2. Hence, the probability distribution we get for C is the same as that for C~ when we prepare the first qubit in the |0 state and ignore its measurement outcome. □

Because we can simulate complex-valued quantum circuits using real-valued unitaries in this direct manner, we don’t need all the real-valued unitaries. Given some approximately universal gate set G we only need to be able to represent U~ for U G. For instance, let’s take the Clifford+T gate set G = {T,H,CNOT}. Now, H and CNOT are already real-valued, and it is easy to see that U~ = I U if U is real-valued, so that it remains to see what gates we need for T~. Calculating its matrix, we see that it is

T~ = ( 1 0 0 0 0 1 2 0 1 2 0 0 1 0 0 1 2 0 1 2 ) .

It is straightforward to verify that this is equal to:

(10.82)

We can construct the CZ gate using H and CNOT so that we only additionally need the controlled-Hadamard gate.

Proposition 10.7.11. The gate set {CNOT,CH} is computationally universal.

Proof. We can use an ancilla in the |1 state and a CH gate to create the Hadamard. Using Hadamard and CNOT we can construct the CZ gate. We can hence represent T~, H~ = I H and CNOT~ = I CNOT. Since {T,H,CNOT} is approximately universal, this means that we can use the {X,CNOT,CH} gate set to arbitrarily closely simulate any unitary. □

The controlled-Hadamard is a bit of an arbitrary choice of gate. It turns out that the gate set of Hadamard and Toffoli is also computationally universal. This is a very pleasing result, because the Toffoli gate is universal for reversible classical computing, while the Hadamard gate is the single-qubit Fourier transform. So this in a way shows that the extra power of quantum computing comes from having access to this Fourier transform. The way we prove its computational universality, is by showing that Toffoli-Hadamard ‘simulates’ the CS-Hadamard gate set, which we know is computationally universal by Proposition 10.5.1. The real-valued encoding CS~ of the CS gate is equivalent to a Toffoli up to some swaps. With the Hadamard we just get H~ = I H. Hence, when we encode the CS+Hadamard gate set, we get the Toffoli+Hadamard gate set. We can hence do the following: starting with a Clifford+T computation, we write it as an ensemble of CS+Hadamard circuits. We then encode each of these circuits into a real-valued Toffoli+Hadamard circuit. By doing this we can efficiently simulate the original Clifford+T circuit. We see then that Toffoli+Hadamard circuits are also computationally universal.

Theorem 10.7.12. The Toffoli+Hadamard gate set is computationally universal.

So just using Hadamard gates and Toffoli gates we can simulate any quantum computation to any desired precision. The computational universality of real-valued quantum computing has an interesting philosophical consequence. One could wonder why quantum mechanics uses complex numbers. Why not just stick to regular old real numbers? The results of this section show that one plausible answer is in any case not a solution: we don’t need the complex numbers to reach a certain computational complexity, as we could have done the same with real numbers.

10.8 References and further reading

Exact Clifford+T synthesis That unitaries over the ring 𝔻[ω] can be exactly synthesised over the Clifford+T gate set was proved in [107]. The presentation we use here was originally given in Seth Greylyn’s Master thesis [118].

Approximating unitaries with Clifford+T gates The first paper to give an efficient algorithm for approximating single-qubit phase gates with Clifford+T gates was given in [207], which found a T-count complexity of 4log2(1𝜖). This was improved to the optimal 3log2(1𝜖) in [201]. A good reference for reading about the improvements made to approximate Clifford+T synthesis by incorporating ancillae, measurements and classical control is [154].

Spider nests That a 4-qubit phase gadget can be decomposed into the collection of all phase gadgets with at most 3 legs first appeared in [12], which hence gives the first appearance of a spider nest identity. The name ‘spider nest’ was coined in [75] which also described the optimisation technique based on toggling gadgets based on small spider nest identities. The notion of a strongly 3-even matrix was introduced in [35]. The relation between strongly 3-even matrices and T-count optimisation seems to have been folklore for a number of years as the strongly 3-even matrices were mostly used in the context of quantum error correcting codes to understand when a code has a transversal T gate (we will have a lot more to say about all that in the next chapter). The relation between strongly 3-even matrices and spider-nest identities was formally spelled out in [149]. In [16] it was shown that phase gadgets with dyadic angles are the only gadgets that allow non-trivial identities. If we for instance have gadgets with phases that are multiples of π 3 , then the only way they can cancel out is if they do so in the trivial way where the gadgets fuse together. See also [224] where they show that in certain settings the only optimisation we can do to ‘black-box’ non-Clifford phases is to fuse them.

T-count optimisation The idea to use small spider nests to optimise T-count was introduced in [75] and later improved upon in followup work by the same authors [74]. The relationship between T-count optimisation and Reed-Muller codes was established in [16] and the relation to symmetric 3-tensor factorisation in [126]. It is the representation as a symmetric 3-tensor that is currently the leading approach in T-count optimisation, with the current best methods being [202228]. The NP-hardness of T-count optimisation was established in [223].

Scalable ZX The scalable ZX-calculus was formally introduced in [49], though many of the new features appeared in an informal way in an earlier preprint [51]. The CNOT+T completeness in scalable ZX of the Clifford rules plus the (S4) rule appeared in [149]. Note that Clifford+(S4) is not complete for the full fragment of ZX-diagrams where the phases are multiples of π 4 . This is because (S4) preserves the same invariant that was used in [191] to argue that the Clifford rules are incomplete for the universal fragment of ZX-diagrams. In [136] a complete rule set for ZX-diagrams with π 4 phases was given.

Catalysis The CCZ to 2 |T catalysis was first described in [105], while the T-CS catalysis seems to be more of a folklore result. The idea of using catalysis to relate different gate sets was introduced in [13]. The synthesis of small phase gates using an adder and measurement-based uncomputation was described in [103]. The relation between catalysis, computational universality and completeness was studied by the authors of this book in [144]. This paper serves as the basis for Section 10.5.

Toffoli-Hadamard universality The original proof that Toffoli+Hadamard is computational universal is given in [208] (this also seems to be the paper that coined the term ‘computational universality’). This actually shows this property for a wider set of gates. Namely, if you have a Toffoli gate and any non-computational-basis-preserving gate, then this will lie dense in the group of special orthogonal matrices. A simpler proof of Toffoli+H universality, which is what we use, is given in [4], which also used the term ‘computational universality’. They show this easily by establishing that the gate set consisting of the controlled-S gate and the Hadamard gate maps to the Toffoli+Hadamard gate set. The approximate universality of the CS+H gate set was shown in [152, Lemma 4.6 on p. 1213]. Kitaev proves this using a ‘geometric lemma’ that adding a gate to a set of gates that stabilises a given state creates a larger-dimensional space of gates. This proof is not constructive. He then proves his Solovay-Kitaev algorithm to show how you would do it constructively (that proof only requires that the set of gates does in fact lie dense in the group). Note that what he calls S is the Hadamard gate, and K is the S gate. In this book we instead show that CS+H is ‘just’ computationally universal, as this follows much more directly using catalysis. This proof originally appeared in [144].