Using Python Generators

In this post, we will discuss about generators in python. In this age of big data it is not unlikely to encounter a large dataset that can’t be loaded into RAM. In such scenarios, it is natural to extract workable chunks of data and work on it. Generators help us do just that. Generators are almost like functions but with a vital difference. While functions produce all their outputs at once, generators produce their outputs one by one and that too when asked. Much has been written about generators. So our aim is not to restate those again. We would rather give two toy examples showing how generators work. Hopefully, these examples will be useful for beginners.

While functions use keyword return to produce outputs, generators use yield. Use of yield in a function automatically makes that function a generator. We can write generators that work for few iterations or indefinitely (It’s an infinite loop). Deep learning frameworks like Keras expect the generators to work indefinitely. So we will also write generators that work indefinitely.

First let’s create artificial data that we will extract later batch by batch.

import numpy as np

data = np.random.randint(100,150, size = (10,2,2))
labels = np.random.permutation(10)
print(data)
print("labels:", labels)

[[[132 119]
  [126 119]]

 [[133 126]
  [144 140]]

 [[126 129]
  [116 146]]

 [[145 104]
  [143 143]]

 [[114 122]
  [102 148]]

 [[122 118]
  [145 134]]

 [[131 134]
  [122 104]]

 [[145 103]
  [136 138]]

 [[128 119]
  [141 118]]

 [[106 115]
  [124 130]]]
labels: [3 5 8 4 0 9 1 6 7 2]

Let’s pretend that the above dataset is huge and we need to extract chunks of it. Now we will write a generator to extract from the above data a batch of two items, two data points and corresponding two labels. In deep learning applications, we want our data to be shuffled between epochs. For the first run, we can shuffle the data itself and from next epoch onwards generator will shuffle it for us. And the generator must run indefinitely.

def my_gen(data, labels, batch_size = 2):
    i = 0
    while True:
        if i*batch_size >= len(labels):
            i = 0
            idx = np.random.permutation(len(labels))
            data, labels = data[idx], labels[idx]
            continue
        else:
            X = data[i*batch_size:(i+1)*batch_size,:]
            y = labels[i*batch_size:(i+1)*batch_size]
            i += 1
            yield X,y

Note that we have conveniently glossed over a technical point here. As the data is a numpy ndarry, to extract parts of it, we have to first load it. If our data set is huge, this method fails there. But there are ways to work around this problem. First, we can read numpy files without loading the whole file into RAM. More details can be found here. Secondly, in deep learning we encounter multiple files each of small size. In that case we can create a dictionary of indexes and file names and then load only a few of those as per index value. These modifications can be easily incorporated as per our need. Details can be found here.

Now that we have created a generator, we have to test it to see whether it functions as intended or not. So we will extract 10 batches of size 2 each from the (data, labels) pair and see. Here we have assumed that our original data is shuffled. If it is not, we can easily shuffle it by using “np.shuffle()”.

get_data = my_gen(data,labels)
for i in range(10):
    X,y = next(get_data)
    print(X,y)
    print(X.shape, y.shape)
    print("=========================")

[[[132 119]
  [126 119]]

 [[133 126]
  [144 140]]] [3 5]
(2, 2, 2) (2,)
=========================
[[[126 129]
  [116 146]]

 [[145 104]
  [143 143]]] [8 4]
(2, 2, 2) (2,)
=========================
[[[114 122]
  [102 148]]

 [[122 118]
  [145 134]]] [0 9]
(2, 2, 2) (2,)
=========================
[[[131 134]
  [122 104]]

 [[145 103]
  [136 138]]] [1 6]
(2, 2, 2) (2,)
=========================
[[[128 119]
  [141 118]]

 [[106 115]
  [124 130]]] [7 2]
(2, 2, 2) (2,)
=========================
[[[132 119]
  [126 119]]

 [[145 104]
  [143 143]]] [3 4]
(2, 2, 2) (2,)
=========================
[[[131 134]
  [122 104]]

 [[126 129]
  [116 146]]] [1 8]
(2, 2, 2) (2,)
=========================
[[[133 126]
  [144 140]]

 [[106 115]
  [124 130]]] [5 2]
(2, 2, 2) (2,)
=========================
[[[114 122]
  [102 148]]

 [[122 118]
  [145 134]]] [0 9]
(2, 2, 2) (2,)
=========================
[[[128 119]
  [141 118]]

 [[145 103]
  [136 138]]] [7 6]
(2, 2, 2) (2,)
=========================

In the above generator code, we manually shuffled the data between epochs. But in keras we can use Sequence class to do this for us automatically. The added advantage of using this class is that we can use multiprocessing capabilities. So the new generator code becomes:

import tensorflow as tf
print("Tensorflow Version: ", tf.__version__)

Tensorflow Version:  2.4.0

class my_new_gen(tf.keras.utils.Sequence):
    def __init__(self, data, labels, batch_size= 2 ):
        self.x, self.y = data, labels
        self.batch_size = batch_size
        self.indices = np.arange(self.x.shape[0])

    def __len__(self):
        return tf.math.floor(self.x.shape[0] / self.batch_size)

    def __getitem__(self, idx):
        inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_x = self.x[inds]
        batch_y = self.y[inds]
        return batch_x, batch_y
    
    def on_epoch_end(self):
        np.random.shuffle(self.indices)

In this case we must add len method and getitem method within the class and if we want to shuffle data between epochs, we have to add on_epoch_end() method. len finds out the number of batches possible in an epoch and getitem extracts batches one by one. When one epoch is complete, on_epoch_end() shuffles the data and the process continues. We will test it with an example.

get_new_data = my_new_gen(data, labels)

for i in range(10):
    if i == 5:
        get_new_data.on_epoch_end()
        i = 0
    elif i > 5:
        i = i-5
    dat,labs = get_new_data.__getitem__(i)
    print(dat,labs)
    print(dat.shape, labs.shape)
    print("===========================")

[[[132 119]
  [126 119]]

 [[133 126]
  [144 140]]] [3 5]
(2, 2, 2) (2,)
===========================
[[[126 129]
  [116 146]]

 [[145 104]
  [143 143]]] [8 4]
(2, 2, 2) (2,)
===========================
[[[114 122]
  [102 148]]

 [[122 118]
  [145 134]]] [0 9]
(2, 2, 2) (2,)
===========================
[[[131 134]
  [122 104]]

 [[145 103]
  [136 138]]] [1 6]
(2, 2, 2) (2,)
===========================
[[[128 119]
  [141 118]]

 [[106 115]
  [124 130]]] [7 2]
(2, 2, 2) (2,)
===========================
[[[145 103]
  [136 138]]

 [[133 126]
  [144 140]]] [6 5]
(2, 2, 2) (2,)
===========================
[[[126 129]
  [116 146]]

 [[122 118]
  [145 134]]] [8 9]
(2, 2, 2) (2,)
===========================
[[[145 104]
  [143 143]]

 [[128 119]
  [141 118]]] [4 7]
(2, 2, 2) (2,)
===========================
[[[131 134]
  [122 104]]

 [[114 122]
  [102 148]]] [1 0]
(2, 2, 2) (2,)
===========================
[[[132 119]
  [126 119]]

 [[106 115]
  [124 130]]] [3 2]
(2, 2, 2) (2,)
===========================

Both the generators work fine. Now we will use it to implement a CNN model on MNIST data. Note that this example is bit stretched and strange. We don’t need generators to implement small data sets like MNIST. Whole of MNIST can be loaded into RAM. By this example the aim is just to show a different way of implementing it using generators. Of course the codes can be modified to handle cases where we indeed need generators to do analysis.

from tensorflow.keras.models import Sequential
from tensorflow.keras import layers

(train_data, train_labels),(test_data,test_labels) = tf.keras.datasets.mnist.load_data()

train_data = train_data.reshape(60000,28,28,1)/255.
id = np.random.permutation(len(train_labels))
training_data, training_labels = train_data[id[0:48000]], train_labels[id[0:48000]]
val_data, val_labels = train_data[id[48000:60000]], train_labels[id[48000:60000]]

model = Sequential([
    layers.Conv2D(32, 3, activation = 'relu', input_shape = (28,28,1)),
    layers.MaxPool2D(2),
    layers.Conv2D(64,5,activation = 'relu'),
    layers.MaxPool2D(2),
    layers.Flatten(),
    layers.Dense(32,activation = 'relu'),
    layers.Dense(10, activation = 'sigmoid')
])
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

# Keras requires the generator to run indefinitely
class data_gen(tf.keras.utils.Sequence):
    def __init__(self, data, labels, batch_size=128):
        self.x, self.y = data, labels
        self.batch_size = batch_size
        self.indices = np.arange(self.x.shape[0])

    def __len__(self):
        return int(tf.math.ceil(self.x.shape[0] / self.batch_size))

    def __getitem__(self, idx):
        inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_x = self.x[inds]
        batch_y = self.y[inds]
        return batch_x, tf.keras.utils.to_categorical(batch_y)
    
    def on_epoch_end(self):
        np.random.shuffle(self.indices)

train_gen = data_gen(train_data, train_labels,batch_size = 128)
val_gen = data_gen(val_data, val_labels,batch_size = 128)
batch_size = 128
steps_per_epoch = np.floor(len(train_labels)/batch_size)
val_steps = np.floor(len(val_labels)/batch_size)

model.fit(train_gen, steps_per_epoch = steps_per_epoch, epochs = 10,
          validation_data = val_gen, validation_steps = val_steps)

Epoch 1/10
468/468 [==============================] - 10s 10ms/step - loss: 0.5769 - accuracy: 0.8351 - val_loss: 0.0858 - val_accuracy: 0.9716
Epoch 2/10
468/468 [==============================] - 3s 7ms/step - loss: 0.0795 - accuracy: 0.9756 - val_loss: 0.0454 - val_accuracy: 0.9860
Epoch 3/10
468/468 [==============================] - 3s 7ms/step - loss: 0.0512 - accuracy: 0.9839 - val_loss: 0.0377 - val_accuracy: 0.9883
Epoch 4/10
468/468 [==============================] - 3s 7ms/step - loss: 0.0389 - accuracy: 0.9879 - val_loss: 0.0278 - val_accuracy: 0.9908
Epoch 5/10
468/468 [==============================] - 3s 7ms/step - loss: 0.0299 - accuracy: 0.9908 - val_loss: 0.0279 - val_accuracy: 0.9899
Epoch 6/10
468/468 [==============================] - 3s 7ms/step - loss: 0.0238 - accuracy: 0.9922 - val_loss: 0.0170 - val_accuracy: 0.9950
Epoch 7/10
468/468 [==============================] - 3s 7ms/step - loss: 0.0214 - accuracy: 0.9931 - val_loss: 0.0118 - val_accuracy: 0.9966
Epoch 8/10
468/468 [==============================] - 3s 7ms/step - loss: 0.0158 - accuracy: 0.9950 - val_loss: 0.0146 - val_accuracy: 0.9952
Epoch 9/10
468/468 [==============================] - 3s 7ms/step - loss: 0.0141 - accuracy: 0.9955 - val_loss: 0.0107 - val_accuracy: 0.9974
Epoch 10/10
468/468 [==============================] - 3s 7ms/step - loss: 0.0128 - accuracy: 0.9957 - val_loss: 0.0078 - val_accuracy: 0.9977

<tensorflow.python.keras.callbacks.History at 0x2543f4e64c0>

test_loss, test_accuracy = model.evaluate(test_data.reshape(10000,28,28,1)/255., tf.keras.utils.to_categorical(test_labels), verbose = 2)
print("Test Loss:", test_loss)
print("Test Accuracy:", test_accuracy)

313/313 - 1s - loss: 0.0282 - accuracy: 0.9922
Test Loss: 0.0281691811978817
Test Accuracy: 0.9922000169754028

We have reached close to 99% accuracy which is not bad! This example might seem a bit stretched as we don’t need generators for small datasets like MNIST. The aim of the example is just to show different implementation using generators.

Perhaps the most detailed blog about using generators for deep learning is this one. I also found these comments helpful.

Update 1: With the release of Tensorflow-2.0, it is much easier to use tf.data.Dataset API for handling large datasets. Generators can still be used for training using tf.keras. As a final note, use generators if it is absolutely essential to do so. Otherwise, use tf.data.Dataset API. Check out this post for an end-to-end data pipeline and training using generators in Tensorflow 2.

Update 2: See this blog for a complete workflow for reading multiple files using Tensorflow Sequence.

Related