# IndexedSlices in Tensorflow

Run in Google Colab | View source on GitHub | Download notebook |

In this post, we will discuss about `IndexedSlices`

class of `Tensorflow`

. We will try to answer the following questions in this blog:

## What are `IndexedSlices`

?

According to `Tensorflow`

documentation, `IndexedSlices`

are sparse representation of a set of tensor slices at a given index. At an high level it appears to be some kind of sparse representation. Let’s try to understand it with examples.

## Where do we get it?

We get `IndexedSlices`

while taking gradients of an `Embedding`

layer. Embedding matrices can be huge (depending on vocabulary size). But each batch only contains a small fraction of tokens. So while computing the gradient of loss with respect to embedding layer, in each pass we have to only consider the corresponding token embeedings of the present batch. Naturally a sparse tensor seems to be a better option to record those gradients. `Tensorflow`

does that using `IndexedSlices`

. We will show that below using a contrived example.

```
import tensorflow as tf
print("Tensorflow version: ", tf.__version__)
```

`Tensorflow version: 2.4.0`

```
model = tf.keras.models.Sequential([
# Vocab size: 10, Embedding dimension: 4, Input_shape size: (batch_size, num_words). As usual, batch_size is omitted.
tf.keras.layers.Embedding(10, 4, input_shape = (5,)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(1)
])
model.summary()
```

```
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 5, 4) 40
_________________________________________________________________
flatten (Flatten) (None, 20) 0
_________________________________________________________________
dense (Dense) (None, 1) 21
=================================================================
Total params: 61
Trainable params: 61
Non-trainable params: 0
_________________________________________________________________
```

```
data = tf.random.uniform(shape = (1, 5), minval = 0, maxval = 10, dtype = tf.int32) # Batch size is 1.
data
```

`<tf.Tensor: shape=(1, 5), dtype=int32, numpy=array([[6, 1, 1, 4, 8]])>`

`model.variables # Is a list of 3 tensors. 1 from Embedding layer and 2 from Dense layer (Kernel and bias)`

```
[<tf.Variable 'embedding/embeddings:0' shape=(10, 4) dtype=float32, numpy=
array([[ 4.10897247e-02, -2.48962641e-03, 1.26880072e-02,
3.39310430e-02],
[ 3.28579657e-02, 3.90318781e-03, 2.81411521e-02,
3.09719704e-02],
[ 1.16247907e-02, -1.41257644e-02, -3.36343870e-02,
-4.41543460e-02],
[-4.67238426e-02, 2.42819674e-02, -4.26802635e-02,
-2.59207971e-02],
[ 2.28367783e-02, -2.09717881e-02, 1.05572566e-02,
3.33249308e-02],
[-3.37148309e-02, -4.61939685e-02, -2.61853095e-02,
-4.10162285e-03],
[-3.59787717e-02, 2.78765075e-02, -3.16200405e-02,
4.54976298e-02],
[-4.67344411e-02, -1.30221620e-02, 1.52915232e-02,
2.22466923e-02],
[-1.03901625e-02, 2.40740217e-02, -1.24427900e-02,
4.47194651e-03],
[-3.57637033e-02, 4.28059734e-02, -2.59280205e-05,
4.09286283e-02]], dtype=float32)>,
<tf.Variable 'dense/kernel:0' shape=(20, 1) dtype=float32, numpy=
array([[ 0.42870212],
[ 0.04779923],
[ 0.4126016 ],
[-0.13294601],
[-0.3175783 ],
[-0.46080017],
[-0.23412797],
[ 0.30137837],
[-0.5197849 ],
[-0.10935467],
[ 0.5087845 ],
[-0.06930307],
[ 0.10028934],
[-0.11278141],
[-0.21269777],
[-0.0214209 ],
[ 0.12959635],
[-0.13330323],
[-0.23972857],
[ 0.23718971]], dtype=float32)>,
<tf.Variable 'dense/bias:0' shape=(1,) dtype=float32, numpy=array([0.], dtype=float32)>]
```

```
optimizer = tf.keras.optimizers.SGD(learning_rate = 0.1)
loss_object = tf.keras.losses.MeanSquaredError()
target = tf.constant([2.5], shape = (1,1))
for _ in range(2): # Let's run gradient descent for two batches of the same input data. (It's a contrived examples)
with tf.GradientTape() as tape:
output = model(data) # Output has shape: (batch_size, 1). Here batch_size is 1. So output shape is (1,1)
loss_value = loss_object(target, output) # Calculating some random loss.
grads = tape.gradient(loss_value, model.trainable_variables)
# Gradient descent step
optimizer.apply_gradients(zip(grads, model.trainable_variables))
```

`len(grads)`

`3`

`grads[0]`

`<tensorflow.python.framework.indexed_slices.IndexedSlices at 0x16c04d4a970>`

`print(grads[0])`

```
IndexedSlices(indices=tf.Tensor([6 1 1 4 8], shape=(5,), dtype=int32), values=tf.Tensor(
[[-0.9495101 -0.14344962 -0.91739434 0.25398374]
[ 0.69607455 1.0615798 0.5085497 -0.7338184 ]
[ 1.1639317 0.24842012 -1.2103697 0.12384857]
[-0.25895947 0.2856651 0.47968888 0.01028775]
[-0.28760925 0.28005898 0.56933826 -0.5540699 ]], shape=(5, 4), dtype=float32), dense_shape=tf.Tensor([10 4], shape=(2,), dtype=int32))
```

An `IndexedSlices`

object has 3 main entries.

- indices
- values, and
- dense_shape

## How to convert `IndexedSlices`

to `Tensors`

?

Before we do the conversion, let’s answer a relevant question: Why do we have to do the conversion from `IndexedSlices`

to tensors given that `Tensorflow`

can do a gradient descent step automatically through the `IndexedSlices`

? In the last section, we could run 2 gradient descent steps without worrying about `IndexedSlices`

.

But the problem occurs if we want to do some processing on gradient values. One such processing is `gradient clipping`

. In `gradient clipping`

, if sum of norm of gradients exceed a given value, gradients are rescaled to decrease their magnitude. Therefore, to do any gradient clipping, we have to access the gradient tensors. This is precisely where we would like to convert IndexedSlices to tensors. Having an embedding layer is common in deep learning models and applying gradient clipping to gradient values is also a common practice. We will show two approaches to do the conversion.

### Easiest approach

`tf.convert_to_tensor(grads[0])`

```
<tf.Tensor: shape=(10, 4), dtype=float32, numpy=
array([[ 0. , 0. , 0. , 0. ],
[ 1.8600063 , 1.31 , -0.70182 , -0.60996985],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[-0.25895947, 0.2856651 , 0.47968888, 0.01028775],
[ 0. , 0. , 0. , 0. ],
[-0.9495101 , -0.14344962, -0.91739434, 0.25398374],
[ 0. , 0. , 0. , 0. ],
[-0.28760925, 0.28005898, 0.56933826, -0.5540699 ],
[ 0. , 0. , 0. , 0. ]],
dtype=float32)>
```

### What did just happen in the last step?

Though the last approach is a single line elegant solution, it hides many things. How actually is the conversion done? The code below shows the steps in which we can manually do the conversion.

```
check_grad = tf.zeros_like(model.variables[0]).numpy() # Create a dense tensor of all zeros
for i, ind in enumerate(grads[0].indices):
check_grad[ind] = check_grad[ind] + grads[0].values[i]
check_grad
```

```
array([[ 0. , 0. , 0. , 0. ],
[ 1.8600063 , 1.31 , -0.70182 , -0.60996985],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[-0.25895947, 0.2856651 , 0.47968888, 0.01028775],
[ 0. , 0. , 0. , 0. ],
[-0.9495101 , -0.14344962, -0.91739434, 0.25398374],
[ 0. , 0. , 0. , 0. ],
[-0.28760925, 0.28005898, 0.56933826, -0.5540699 ],
[ 0. , 0. , 0. , 0. ]],
dtype=float32)
```

This brings us to the end of this blog. I hope this blog has demystified a few things about `IndexedSlices`

.

**Motivation for this post**: While writing TF 2 code for Attention Mechanisms chapter of D2L book, the author encountered an error involving `IndexedSlices`

. After spending a good deal of time hopelessly trying to figure out what’s going on, the author finally found that the error was occurring because of an user defined gradient clipping function that didn’t handle `IndexedSlices`

properly. The model involved embedding layers as it was dealing with machine translation task. Therefore, I thought of writing this blog with the hope that it would be of help to readers who are struggling to figure out what `IndexedSlices`

are.