Reading multiple files in Tensorflow 2

Run in Google Colab View source on GitHub Download notebook

In this post, we will read multiple .csv files into Tensorflow using generators. But the method we will discuss is general enough to work for other file formats as well. We will demonstrate the procedure using 500 .csv files. These files have been created using random numbers. Each file contains only 1024 numbers in one column. This method can easily be extended to huge datasets involving thousands of .csv files. As the number of files becomes large, we can’t load the whole data into memory. So we have to work with chunks of it. Generators help us do just that conveniently. In this post, we will read multiple files using a custom generator.

This post is self-sufficient in the sense that readers don’t have to download any data from anywhere. Just run the following codes sequentially. First, a folder named “random_data” will be created in current working directory and .csv files will be saved in it. Subsequently, files will be read from that folder and processed. Just make sure that your current working directory doesn’t have an old folder named “random_data”. Then run the following code cells.

We will use Tensorflow 2 to run our deep learning model. Tensorflow is very flexible. A given task can be done in different ways in it. The method we will use is not the only one. Readers are encouraged to explore other ways of doing the same. Below is an outline of three different tasks considered in this post.

Outline:

  1. Create 500 ".csv" files and save it in the folder “random_data” in current directory.
  2. Write a generator that reads data from the folder in chunks and preprocesses it.
  3. Feed the chunks of data to a CNN model and train it for several epochs.

1. Create 500 .csv files of random data

As we intend to train a CNN model for classification using our data, we will generate data for 5 different classes. Following is the process that we will follow.

  • Each .csv file will have one column of data with 1024 entries.

  • Each file will be saved using one of the following names (Fault_1, Fault_2, Fault_3, Fault_4, Fault_5). The dataset is balanced, meaning, for each category, we have approximately same number of observations. Data files in “Fault_1” category will have names as “Fault_1_001.csv”, “Fault_1_002.csv”, “Fault_1_003.csv”, …, “Fault_1_100.csv”. Similarly for other classes.

import numpy as np
import os
import glob
np.random.seed(1111)

First create a function that will generate random files.

def create_random_csv_files(fault_classes, number_of_files_in_each_class):
    os.mkdir("./random_data/")  # Make a directory to save created files.
    for fault_class in fault_classes:
        for i in range(number_of_files_in_each_class):
            data = np.random.rand(1024,)
            file_name = "./random_data/" + eval("fault_class") + "_" + "{0:03}".format(i+1) + ".csv" # This creates file_name
            np.savetxt(eval("file_name"), data, delimiter = ",", header = "V1", comments = "")
        print(str(eval("number_of_files_in_each_class")) + " " + eval("fault_class") + " files"  + " created.")

Now use the function to create 100 files each for five fault types.

create_random_csv_files(["Fault_1", "Fault_2", "Fault_3", "Fault_4", "Fault_5"], number_of_files_in_each_class = 100)
100 Fault_1 files created.
100 Fault_2 files created.
100 Fault_3 files created.
100 Fault_4 files created.
100 Fault_5 files created.
files = glob.glob("./random_data/*")
print("Total number of files: ", len(files))
print("Showing first 10 files...")
files[:10]
Total number of files:  500
Showing first 10 files...





['./random_data/Fault_1_001.csv',
 './random_data/Fault_1_002.csv',
 './random_data/Fault_1_003.csv',
 './random_data/Fault_1_004.csv',
 './random_data/Fault_1_005.csv',
 './random_data/Fault_1_006.csv',
 './random_data/Fault_1_007.csv',
 './random_data/Fault_1_008.csv',
 './random_data/Fault_1_009.csv',
 './random_data/Fault_1_010.csv']

To extract labels from file name, extract the part of the file name that corresponds to fault type.

print(files[0])
./random_data/Fault_1_001.csv
print(files[0][14:21])
Fault_1

Now that data have been created, we will go to the next step. That is, define a generator, preprocess the time series like data into a matrix like shape such that a 2-D CNN can ingest it.

2. Write a generator that reads data in chunks and preprocesses it

Generator are similar to functions with one important difference. While functions produce all their outputs at once, generators produce their outputs one by one and that too when asked. yield keyword converts a function into a generator. Generators can run for a fixed number of times or indefinitely depending on the loop structure used inside it. For our application, we will use a generator that runs indefinitely.

The following generator takes a list of file names as first argument. The second argument is batch_size. batch_size determines how many files we will process at one go. This is determined by how much memory do we have. If all data can be loaded into memory, there is no need for generators. In case our data size is huge, we can process chunks of it.

As we will be solving a classification problem, we have to assign labels to each raw data. We will use following labels for convenience.

Class Label
Fault_1 0
Fault_2 1
Fault_3 2
Fault_4 3
Fault_5 4

The generator will yield both data and labels.

import pandas as pd
import re            # To match regular expression for extracting labels
def data_generator(file_list, batch_size = 20):
    i = 0
    while True:
        if i*batch_size >= len(file_list):  # This loop is used to run the generator indefinitely.
            i = 0
            np.random.shuffle(file_list)
        else:
            file_chunk = file_list[i*batch_size:(i+1)*batch_size] 
            data = []
            labels = []
            label_classes = ["Fault_1", "Fault_2", "Fault_3", "Fault_4", "Fault_5"]
            for file in file_chunk:
                temp = pd.read_csv(open(file,'r')) # Change this line to read any other type of file
                data.append(temp.values.reshape(32,32,1)) # Convert column data to matrix like data with one channel
                pattern = "^" + eval("file[14:21]")      # Pattern extracted from file_name
                for j in range(len(label_classes)):
                    if re.match(pattern, label_classes[j]): # Pattern is matched against different label_classes
                        labels.append(j)  
            data = np.asarray(data).reshape(-1,32,32,1)
            labels = np.asarray(labels)
            yield data, labels
            i = i + 1

To read any other file format, inside the generator change the line that reads files. This will enable us to read different file formats, be it .txt or .npz or any other. Preprocessing of data, different from what we have done in this blog, can be done within the generator loop.

Now we will check whether the generator works as intended or not. We will set batch_size to 10. This means that files in chunks of 10 will be read and processed. The list of files from which 10 are chosen can be an ordered file list or shuffled list. In case, the files are not shuffled, use np.random.shuffle(file_list) to shuffle files.

In the demonstration, we will read files from an ordered list. This will help us check any errors in the code.

generated_data = data_generator(files, batch_size = 10)
num = 0
for data, labels in generated_data:
    print(data.shape, labels.shape)
    print(labels, "<--Labels")  # Just to see the labels
    print()
    num = num + 1
    if num > 5: break
(10, 32, 32, 1) (10,)
[0 0 0 0 0 0 0 0 0 0] <--Labels

(10, 32, 32, 1) (10,)
[0 0 0 0 0 0 0 0 0 0] <--Labels

(10, 32, 32, 1) (10,)
[0 0 0 0 0 0 0 0 0 0] <--Labels

(10, 32, 32, 1) (10,)
[0 0 0 0 0 0 0 0 0 0] <--Labels

(10, 32, 32, 1) (10,)
[0 0 0 0 0 0 0 0 0 0] <--Labels

(10, 32, 32, 1) (10,)
[0 0 0 0 0 0 0 0 0 0] <--Labels

Run the above cell multiple times to observe different labels. Label 1 appears only when all the files corresponding to “Fault_1” have been read. There are 100 files for “Fault_1” and we have set batch_size to 10. In the above cell we are iterating over the generator only 6 times. When number of iterations become greater than 10, we see label 1 and subsequently other labels. This will happen only if our initial file list is not shuffled. If the original list is shuffled, we will get random labels.

Now we will create a tensorflow dataset using the generator. Tensorflow datasets can conveniently be used to train tensorflow models.

A tensorflow dataset can be created form numpy arrays or from generators.Here, we will create it using a generator. Use of the previously created generator as it is in tensorflow datasets doesn’t work (Readers can verify this). This happens because of the inability of regular expression to compare a “string” with a “byte string”. “byte strings” are generated by default in tensorflow. As a way around, we will make modifications to the earlier generator and use it with tensorflow datasets. Note that we will only modified three lines. Modified lines are accompanied by commented texts beside it.

import tensorflow as tf
print(tf.__version__)
2.2.0
def tf_data_generator(file_list, batch_size = 20):
    i = 0
    while True:
        if i*batch_size >= len(file_list):  
            i = 0
            np.random.shuffle(file_list)
        else:
            file_chunk = file_list[i*batch_size:(i+1)*batch_size] 
            data = []
            labels = []
            label_classes = tf.constant(["Fault_1", "Fault_2", "Fault_3", "Fault_4", "Fault_5"]) # This line has changed.
            for file in file_chunk:
                temp = pd.read_csv(open(file,'r'))
                data.append(temp.values.reshape(32,32,1)) 
                pattern = tf.constant(eval("file[14:21]"))  # This line has changed
                for j in range(len(label_classes)):
                    if re.match(pattern.numpy(), label_classes[j].numpy()):  # This line has changed.
                        labels.append(j)
            data = np.asarray(data).reshape(-1,32,32,1)
            labels = np.asarray(labels)
            yield data, labels
            i = i + 1

Test whether modified generator works or not.

check_data = tf_data_generator(files, batch_size = 10)
num = 0
for data, labels in check_data:
    print(data.shape, labels.shape)
    print(labels, "<--Labels")
    print()
    num = num + 1
    if num > 5: break
(10, 32, 32, 1) (10,)
[0 0 0 0 0 0 0 0 0 0] <--Labels

(10, 32, 32, 1) (10,)
[0 0 0 0 0 0 0 0 0 0] <--Labels

(10, 32, 32, 1) (10,)
[0 0 0 0 0 0 0 0 0 0] <--Labels

(10, 32, 32, 1) (10,)
[0 0 0 0 0 0 0 0 0 0] <--Labels

(10, 32, 32, 1) (10,)
[0 0 0 0 0 0 0 0 0 0] <--Labels

(10, 32, 32, 1) (10,)
[0 0 0 0 0 0 0 0 0 0] <--Labels

Note that the new generator created by using a few tensorflow commands works just fine as our previous generator. This new generator can now be integrated with a tensorflow dataset.

batch_size = 15
dataset = tf.data.Dataset.from_generator(tf_data_generator,args= [files, batch_size],output_types = (tf.float32, tf.float32),
                                                output_shapes = ((None,32,32,1),(None,)))

Check whether dataset works or not.

num = 0
for data, labels in dataset:
    print(data.shape, labels.shape)
    print(labels)
    print()
    num = num + 1
    if num > 7: break
(15, 32, 32, 1) (15,)
tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32)

(15, 32, 32, 1) (15,)
tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32)

(15, 32, 32, 1) (15,)
tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32)

(15, 32, 32, 1) (15,)
tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32)

(15, 32, 32, 1) (15,)
tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32)

(15, 32, 32, 1) (15,)
tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32)

(15, 32, 32, 1) (15,)
tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1.], shape=(15,), dtype=float32)

(15, 32, 32, 1) (15,)
tf.Tensor([1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.], shape=(15,), dtype=float32)

This also works fine. Now, we will train a full CNN model using the generator. As is done in every model, we will first shuffle data files. Split the files into train, validation, and test set. Using the tf_data_generator create three tensorflow datasets corresponding to train, validation, and test data respectively. Finally, we will create a simple CNN model. Train it using train dataset, see its performance on validation dataset, and obtain prediction using test dataset. Keep in mind that our aim is not to improve performance of the model. As the data are random, don’t expect to see good performance. The aim is only to create a pipeline.

3. Building data pipeline and training CNN model

Before building the data pipeline, we will first move files corresponding to each fault class into different folders. This will make it convenient to split data into training, validation, and test set, keeping the balanced nature of the dataset intact.

import shutil

Create five different folders.

fault_folders = ["Fault_1", "Fault_2", "Fault_3", "Fault_4", "Fault_5"]
for folder_name in fault_folders:
    os.mkdir(os.path.join("./random_data", folder_name))

Move files into those folders.

for file in files:
    pattern = "^" + eval("file[14:21]")
    for j in range(len(fault_folders)):
        if re.match(pattern, fault_folders[j]):
            dest = os.path.join("./random_data/",eval("fault_folders[j]"))
            shutil.move(file, dest)
glob.glob("./random_data/*")
['./random_data/Fault_1',
 './random_data/Fault_2',
 './random_data/Fault_3',
 './random_data/Fault_4',
 './random_data/Fault_5']
glob.glob("./random_data/Fault_1/*")[:10] # Showing first 10 files of Fault_1 folder
['./random_data/Fault_1/Fault_1_001.csv',
 './random_data/Fault_1/Fault_1_002.csv',
 './random_data/Fault_1/Fault_1_003.csv',
 './random_data/Fault_1/Fault_1_004.csv',
 './random_data/Fault_1/Fault_1_005.csv',
 './random_data/Fault_1/Fault_1_006.csv',
 './random_data/Fault_1/Fault_1_007.csv',
 './random_data/Fault_1/Fault_1_008.csv',
 './random_data/Fault_1/Fault_1_009.csv',
 './random_data/Fault_1/Fault_1_010.csv']
glob.glob("./random_data/Fault_3/*")[:10] # Showing first 10 files of Fault_3 folder
['./random_data/Fault_3/Fault_3_001.csv',
 './random_data/Fault_3/Fault_3_002.csv',
 './random_data/Fault_3/Fault_3_003.csv',
 './random_data/Fault_3/Fault_3_004.csv',
 './random_data/Fault_3/Fault_3_005.csv',
 './random_data/Fault_3/Fault_3_006.csv',
 './random_data/Fault_3/Fault_3_007.csv',
 './random_data/Fault_3/Fault_3_008.csv',
 './random_data/Fault_3/Fault_3_009.csv',
 './random_data/Fault_3/Fault_3_010.csv']

Prepare that data for training set, validation set, and test_set. For each fault type, we will keep 70 files for training, 10 files for validation and 20 files for testing.

fault_1_files = glob.glob("./random_data/Fault_1/*")
fault_2_files = glob.glob("./random_data/Fault_2/*")
fault_3_files = glob.glob("./random_data/Fault_3/*")
fault_4_files = glob.glob("./random_data/Fault_4/*")
fault_5_files = glob.glob("./random_data/Fault_5/*")
from sklearn.model_selection import train_test_split
fault_1_train, fault_1_test = train_test_split(fault_1_files, test_size = 20, random_state = 5)
fault_2_train, fault_2_test = train_test_split(fault_2_files, test_size = 20, random_state = 54)
fault_3_train, fault_3_test = train_test_split(fault_3_files, test_size = 20, random_state = 543)
fault_4_train, fault_4_test = train_test_split(fault_4_files, test_size = 20, random_state = 5432)
fault_5_train, fault_5_test = train_test_split(fault_5_files, test_size = 20, random_state = 54321)
fault_1_train, fault_1_val = train_test_split(fault_1_train, test_size = 10, random_state = 1)
fault_2_train, fault_2_val = train_test_split(fault_2_train, test_size = 10, random_state = 12)
fault_3_train, fault_3_val = train_test_split(fault_3_train, test_size = 10, random_state = 123)
fault_4_train, fault_4_val = train_test_split(fault_4_train, test_size = 10, random_state = 1234)
fault_5_train, fault_5_val = train_test_split(fault_5_train, test_size = 10, random_state = 12345)
train_file_names = fault_1_train + fault_2_train + fault_3_train + fault_4_train + fault_5_train
validation_file_names = fault_1_val + fault_2_val + fault_3_val + fault_4_val + fault_5_val
test_file_names = fault_1_test + fault_2_test + fault_3_test + fault_4_test + fault_5_test

# Shuffle data (We don't need to shuffle validation and test data)
np.random.shuffle(train_file_names)
print("Number of train_files:" ,len(train_file_names))
print("Number of validation_files:" ,len(validation_file_names))
print("Number of test_files:" ,len(test_file_names))
Number of train_files: 350
Number of validation_files: 50
Number of test_files: 100
batch_size = 10
train_dataset = tf.data.Dataset.from_generator(tf_data_generator, args = [train_file_names, batch_size], 
                                              output_shapes = ((None,32,32,1),(None,)),
                                              output_types = (tf.float32, tf.float32))

validation_dataset = tf.data.Dataset.from_generator(tf_data_generator, args = [validation_file_names, batch_size],
                                                   output_shapes = ((None,32,32,1),(None,)),
                                                   output_types = (tf.float32, tf.float32))

test_dataset = tf.data.Dataset.from_generator(tf_data_generator, args = [test_file_names, batch_size],
                                             output_shapes = ((None,32,32,1),(None,)),
                                             output_types = (tf.float32, tf.float32))

Now create the model.

from tensorflow.keras import layers
model = tf.keras.Sequential([
    layers.Conv2D(16, 3, activation = "relu", input_shape = (32,32,1)),
    layers.MaxPool2D(2),
    layers.Conv2D(32, 3, activation = "relu"),
    layers.MaxPool2D(2),
    layers.Flatten(),
    layers.Dense(16, activation = "relu"),
    layers.Dense(5, activation = "softmax")
])
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 30, 30, 16)        160       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 15, 15, 16)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 13, 13, 32)        4640      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 6, 6, 32)          0         
_________________________________________________________________
flatten (Flatten)            (None, 1152)              0         
_________________________________________________________________
dense (Dense)                (None, 16)                18448     
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 85        
=================================================================
Total params: 23,333
Trainable params: 23,333
Non-trainable params: 0
_________________________________________________________________

Compile the model.

model.compile(loss = "sparse_categorical_crossentropy", optimizer = "adam", metrics = ["accuracy"])

Before we fit the model, we have to do one important calculation. Remember that our generators are infinite loops. So if no stopping criteria is given, it will run indefinitely. But we want our model to run for, say, 10 epochs. So our generator should loop over the data files just 10 times and no more. This is achieved by setting the arguments steps_per_epoch and validation_steps to desired numbers in model.fit(). Similarly while evaluating model, we need to set the argument steps to a desired number in model.evaluate().

There are 350 files in training set. Batch_size is 10. So if the generator runs 35 times, it will correspond to one epoch. Therefor, we should set steps_per_epoch to 35. Similarly, validation_steps = 5 and in model.evaluate(), steps = 10.

steps_per_epoch = np.int(np.ceil(len(train_file_names)/batch_size))
validation_steps = np.int(np.ceil(len(validation_file_names)/batch_size))
steps = np.int(np.ceil(len(test_file_names)/batch_size))
print("steps_per_epoch = ", steps_per_epoch)
print("validation_steps = ", validation_steps)
print("steps = ", steps)
steps_per_epoch =  35
validation_steps =  5
steps =  10
model.fit(train_dataset, validation_data = validation_dataset, steps_per_epoch = steps_per_epoch,
         validation_steps = validation_steps, epochs = 10)
Epoch 1/10
35/35 [==============================] - 1s 40ms/step - loss: 1.6268 - accuracy: 0.2029 - val_loss: 1.6111 - val_accuracy: 0.2000
Epoch 2/10
35/35 [==============================] - 1s 36ms/step - loss: 1.6101 - accuracy: 0.2114 - val_loss: 1.6079 - val_accuracy: 0.2600
Epoch 3/10
35/35 [==============================] - 1s 35ms/step - loss: 1.6066 - accuracy: 0.2343 - val_loss: 1.6076 - val_accuracy: 0.2000
Epoch 4/10
35/35 [==============================] - 1s 34ms/step - loss: 1.5993 - accuracy: 0.2143 - val_loss: 1.6085 - val_accuracy: 0.2400
Epoch 5/10
35/35 [==============================] - 1s 34ms/step - loss: 1.5861 - accuracy: 0.2657 - val_loss: 1.6243 - val_accuracy: 0.2000
Epoch 6/10
35/35 [==============================] - 1s 35ms/step - loss: 1.5620 - accuracy: 0.3514 - val_loss: 1.6363 - val_accuracy: 0.2000
Epoch 7/10
35/35 [==============================] - 1s 36ms/step - loss: 1.5370 - accuracy: 0.2857 - val_loss: 1.6171 - val_accuracy: 0.2600
Epoch 8/10
35/35 [==============================] - 1s 35ms/step - loss: 1.5015 - accuracy: 0.4057 - val_loss: 1.6577 - val_accuracy: 0.2000
Epoch 9/10
35/35 [==============================] - 1s 35ms/step - loss: 1.4415 - accuracy: 0.5086 - val_loss: 1.6484 - val_accuracy: 0.1400
Epoch 10/10
35/35 [==============================] - 1s 36ms/step - loss: 1.3363 - accuracy: 0.6143 - val_loss: 1.6672 - val_accuracy: 0.2200





<tensorflow.python.keras.callbacks.History at 0x7fcab40f6150>
test_loss, test_accuracy = model.evaluate(test_dataset, steps = 10)
10/10 [==============================] - 0s 25ms/step - loss: 1.6974 - accuracy: 0.1500
print("Test loss: ", test_loss)
print("Test accuracy:", test_accuracy)
Test loss:  1.6973648071289062
Test accuracy: 0.15000000596046448

As expected, model performs terribly.

How to make predictions?

Until now, we have evaluated our model on a kept out test set. For our test set, both data and labels were known. So we evaluated its performance. But often times, for test set, we don’t have access to true labels. Rather, we have to make predictions on the data available. This is the case in online competitions where we have to submit our predictions on a test set for which we don’t know the labels. We will call this set (without any labels) the prediction set. This naming convention is arbitrary but we will stick with it.

If the whole of our prediction set fits into memory, we can just call model.predict() on this data and then use np.argmax() to obtain predicted class labels. Otherwise, we can read files in prediction set in chunks, make predictions on the chunks and finally append our result.

Yet another pedantic way of doing this is to write a generator to read files from the prediction set in chunks and make predictions on it. We will show how this approach works. As we don’t have a prediction set yet, we will first create some files and save it to the prediction set.

def create_prediction_set(num_files = 20):
    os.mkdir("./random_data/prediction_set")
    for i in range(num_files):
        data = np.random.randn(1024,)
        file_name = "./random_data/prediction_set/"  + "file_" + "{0:03}".format(i+1) + ".csv" # This creates file_name
        np.savetxt(eval("file_name"), data, delimiter = ",", header = "V1", comments = "")
    print(str(eval("num_files")) + " "+ " files created in prediction set.")

Create some files for prediction set.

create_prediction_set(num_files = 55)
55  files created in prediction set.
prediction_files = glob.glob("./random_data/prediction_set/*")
print("Total number of files: ", len(prediction_files))
print("Showing first 10 files...")
prediction_files[:10]
Total number of files:  55
Showing first 10 files...





['./random_data/prediction_set/file_001.csv',
 './random_data/prediction_set/file_002.csv',
 './random_data/prediction_set/file_003.csv',
 './random_data/prediction_set/file_004.csv',
 './random_data/prediction_set/file_005.csv',
 './random_data/prediction_set/file_006.csv',
 './random_data/prediction_set/file_007.csv',
 './random_data/prediction_set/file_008.csv',
 './random_data/prediction_set/file_009.csv',
 './random_data/prediction_set/file_010.csv']

Now, we will create a generator to read these files in chunks. This generator will be slightly different from our previous generator. Firstly, we don’t want the generator to run indefinitely. Secondly, we don’t have any labels. So this generator should only yield data. This is how we achieve that.

def generator_for_prediction(file_list, batch_size = 20):
    i = 0
    while i <= (len(file_list)/batch_size):
        if i == np.floor(len(file_list)/batch_size):
            file_chunk = file_list[i*batch_size:len(file_list)]
            if len(file_chunk)==0:
                break
        else:
            file_chunk = file_list[i*batch_size:(i+1)*batch_size] 
        data = []
        for file in file_chunk:
            temp = pd.read_csv(open(file,'r'))
            data.append(temp.values.reshape(32,32,1)) 
        data = np.asarray(data).reshape(-1,32,32,1)
        yield data
        i = i + 1

Check whether the generator works or not.

pred_gen = generator_for_prediction(prediction_files,  batch_size = 10)
for data in pred_gen:
    print(data.shape)
(10, 32, 32, 1)
(10, 32, 32, 1)
(10, 32, 32, 1)
(10, 32, 32, 1)
(10, 32, 32, 1)
(5, 32, 32, 1)

Create a tensorflow dataset.

batch_size = 10
prediction_dataset = tf.data.Dataset.from_generator(generator_for_prediction,args=[prediction_files, batch_size],
                                                 output_shapes=(None,32,32,1), output_types=(tf.float32))
steps = np.int(np.ceil(len(prediction_files)/batch_size))
predictions = model.predict(prediction_dataset,steps = steps)
print("Shape of prediction array: ", predictions.shape)
predictions
Shape of prediction array:  (55, 5)

array([[0.28138927, 0.3383776 , 0.17806269, 0.18918239, 0.01298801],
       [0.16730548, 0.20139892, 0.32996896, 0.16305783, 0.13826886],
       [0.08079846, 0.35669118, 0.4091237 , 0.13286887, 0.02051783],
       [0.01697877, 0.79075295, 0.17063092, 0.01676028, 0.00487713],
       [0.19006915, 0.02615157, 0.39364284, 0.09650648, 0.29362988],
       [0.05416911, 0.682985  , 0.19086388, 0.0668761 , 0.00510592],
       [0.21325852, 0.27782622, 0.10314588, 0.39539766, 0.01037181],
       [0.23633875, 0.3308002 , 0.30727112, 0.09573858, 0.02985144],
       [0.06442448, 0.34153524, 0.47356713, 0.08497778, 0.03549532],
       [0.37901744, 0.32311487, 0.12875995, 0.16359715, 0.00551067],
       [0.12227482, 0.49774405, 0.26021793, 0.1060346 , 0.01372868],
       [0.07139122, 0.17324339, 0.5490784 , 0.10136751, 0.10491937],
       [0.18757634, 0.2833261 , 0.3367256 , 0.14390293, 0.04846917],
       [0.23564269, 0.2800771 , 0.19150141, 0.2686058 , 0.02417296],
       [0.4835618 , 0.03908279, 0.09785527, 0.31918615, 0.06031401],
       [0.03285189, 0.5866938 , 0.3362034 , 0.0313101 , 0.01294078],
       [0.31367007, 0.05583594, 0.24806198, 0.2707511 , 0.1116809 ],
       [0.11204866, 0.05982558, 0.44611645, 0.16678827, 0.21522103],
       [0.04504926, 0.7100154 , 0.16532828, 0.0747861 , 0.00482096],
       [0.22441828, 0.01738338, 0.36729604, 0.0961706 , 0.29473177],
       [0.22392808, 0.23958267, 0.11669649, 0.41423568, 0.00555711],
       [0.11768451, 0.16422512, 0.49695587, 0.13158153, 0.08955302],
       [0.04941175, 0.31670955, 0.46190843, 0.12606393, 0.04590632],
       [0.19507076, 0.03239974, 0.3885634 , 0.14447391, 0.23949222],
       [0.3530666 , 0.08613478, 0.11636773, 0.4088019 , 0.03562902],
       [0.12874755, 0.3140329 , 0.3858064 , 0.1278494 , 0.0435637 ],
       [0.3001929 , 0.02791574, 0.11502622, 0.5044482 , 0.05241694],
       [0.0929171 , 0.1467541 , 0.6005069 , 0.06660035, 0.09322156],
       [0.10712272, 0.5518521 , 0.2632791 , 0.06340495, 0.01434106],
       [0.27723876, 0.25847596, 0.18952209, 0.25228631, 0.02247689],
       [0.12578863, 0.44461673, 0.25048074, 0.14304985, 0.03606399],
       [0.09593316, 0.06914104, 0.49921316, 0.1389045 , 0.19680816],
       [0.22185169, 0.0878747 , 0.33703303, 0.23808932, 0.11515129],
       [0.0850782 , 0.06328611, 0.57307494, 0.08615369, 0.19240707],
       [0.41479778, 0.07033634, 0.22154689, 0.2007963 , 0.09252268],
       [0.22052608, 0.10761442, 0.33570328, 0.25846007, 0.07769614],
       [0.03679338, 0.4369671 , 0.42453632, 0.07080499, 0.03089818],
       [0.17414902, 0.3666445 , 0.26953018, 0.16861232, 0.02106389],
       [0.04334973, 0.04427214, 0.5819794 , 0.02825493, 0.30214384],
       [0.23099631, 0.31964707, 0.31392127, 0.11803907, 0.01739628],
       [0.03072637, 0.6739159 , 0.25826213, 0.0309101 , 0.00618558],
       [0.20030826, 0.05058228, 0.42536664, 0.14415787, 0.17958501],
       [0.25894472, 0.0410106 , 0.25135538, 0.15487678, 0.29381245],
       [0.31544876, 0.05200702, 0.20838396, 0.31984535, 0.10431487],
       [0.10788545, 0.31769663, 0.44471365, 0.08522549, 0.04447879],
       [0.01864015, 0.35556656, 0.551683  , 0.02805553, 0.04605483],
       [0.20043266, 0.1211144 , 0.26670808, 0.33885604, 0.07288874],
       [0.29432756, 0.19128233, 0.19503927, 0.2826192 , 0.03673161],
       [0.2151616 , 0.05391361, 0.34218988, 0.11304423, 0.27569073],
       [0.241943  , 0.05663572, 0.23858468, 0.36390153, 0.09893499],
       [0.24665013, 0.22702417, 0.33673155, 0.11996701, 0.06962712],
       [0.05448309, 0.33466634, 0.49283266, 0.07876839, 0.03924957],
       [0.3060696 , 0.03565398, 0.33453086, 0.12989788, 0.19384763],
       [0.1417291 , 0.40642622, 0.20021752, 0.22896914, 0.02265806],
       [0.10395318, 0.20624556, 0.46823606, 0.12000521, 0.10156006]],
      dtype=float32)

Outputs of prediction are 5 dimensional vector. This is so because we have used 5 neurons in the output layer and our activation function is softmax. The 5 dimensional output vector for an input add to 1. So it can be interpreted as probability. Thus we should classify the input to a class, for which prediction probability is maximum. To get the class corresponding to maximum probability, we can use np.argmax() command.

np.argmax(predictions, axis = 1)
array([1, 2, 2, 1, 2, 1, 3, 1, 2, 0, 1, 2, 2, 1, 0, 1, 0, 2, 1, 2, 3, 2,
       2, 2, 3, 2, 3, 2, 1, 0, 1, 2, 2, 2, 0, 2, 1, 1, 2, 1, 1, 2, 4, 3,
       2, 2, 3, 0, 2, 3, 2, 2, 2, 1, 2])

The data are randomly generated. So we should not be surprised by this result. Also note that the for each new data, softmax outputs are close to each other. This means that the network is not very sure about the classification result.

This brings us to the end of the blog. As we had planned in the beginning, we have created random data files, a generator, and trained a model using that generator. The above code can be tweaked slightly to read any type of files other than .csv. And now we can train our model without worrying about the data size. Whether the data size is 10GB or 750GB, our approach will work for both.

As a final note, I want to stress that, this is not the only approach to do the task. As I have mentioned previously, in Tensorflow, you can do the same thing in several different ways. The approach I have chosen seemed natural to me. I have neither strived for efficiency nor elegance. If readers have any better idea, I would be happy to know of it.

I hope, this blog will be of help to readers. Please bring any errors or omissions to my notice.

Update 1: While generators are convenient for handling chunks of data from a large dataset, they have limited portability and scalability. Therefore, in Tensorflow Sequences are preferred instead of generators. See this blog for a complete workflow for reading multiple files using Sequence.

Update 2: If along with reading, one has to perform complex transformations on extracted data (say, doing spectrogram on each segment of data, etc.), the naive approach presented in this blog may turn out to be slow. But there are ways to make these computations faster. One such speedup technique can be found at this blog.

Last modified: 27th April, 2020.

Biswajit Sahoo
Biswajit Sahoo
Machine Learning Engineer

My research interests include machine learning, deep learning, signal processing and data-driven machinery condition monitoring.

Related