Reading multiple csv files in PyTorch

Dec 19, 2020 · 18 min read
Run in Google Colab View source on GitHub Download notebook

In many engineering applications data are usually stored in CSV (Comma Separated Values) files. In big data applications, it’s not uncommon to obtain thousands of csv files. As the number of files increases, at some point, we can no longer load the whole dataset into computer’s memory. In deep learning applications it is increasingly common to come across datasets that don’t fit in the computer’s memory. In that case, we have to devise a way so as to be able to read chunks of data at a time so that the model can be trained using the whole dataset.

There are many ways to achieve this objective. In this post, we will adopt an approach that allows us to read csv files in chunks and preprocess those files in whatever way we like. Then we can pass the processed data to train any deep learning model. Though we will use csv files in this post, the method is general enough to work for other file formats (such as .txt, .npz, etc.) as well. We will demonstrate the procedure using 500 csv files. But the method can be easily extended to huge datasets involving thousands of csv files.

This post is self-sufficient in the sense that readers don’t have to download any data from anywhere. Just run the following codes sequentially. First, a folder named “random_data” will be created in current working directory and .csv files will be saved in it. Subsequently, files will be read from that folder and processed. Just make sure that your current working directory doesn’t have an old folder named “random_data”. Then run the following code cells. We will use PyTorch to run our deep learning model. For efficiency in data loading, we will use PyTorch dataloaders.

Outline:

  1. Create 500 “.csv” files and save it in the folder “random_data” in current working directory.
  2. Create a custom dataloader.
  3. Feed the chunks of data to a CNN model and train it for several epochs.
  4. Make prediction on new data for which labels are not known.

1. Create 500 .csv files of random data

As we intend to train a CNN model for classification using our data, we will generate data for 5 different classes. The dataset that we will create is a contrived one. But readers can modify the approach slightly to cater to their need. Following is the process that we will follow.

  • Each .csv file will have one column of data with 1024 entries.
  • Each file will be saved using one of the following names (Fault_1, Fault_2, Fault_3, Fault_4, Fault_5). The dataset is balanced, meaning, for each category, we have approximately same number of observations. Data files in “Fault_1” category will have names as “Fault_1_001.csv”, “Fault_1_002.csv”, “Fault_1_003.csv”, …, “Fault_1_100.csv”. Similarly for other classes.
import numpy as np
import os
import glob
np.random.seed(1111)

First create a function that will generate random files.

def create_random_csv_files(fault_classes, number_of_files_in_each_class):
    os.mkdir("./random_data/")  # Make a directory to save created files.
    for fault_class in fault_classes:
        for i in range(number_of_files_in_each_class):
            data = np.random.rand(1024,)
            file_name = "./random_data/" + eval("fault_class") + "_" + "{0:03}".format(i+1) + ".csv" # This creates file_name
            np.savetxt(eval("file_name"), data, delimiter = ",", header = "V1", comments = "")
        print(str(eval("number_of_files_in_each_class")) + " " + eval("fault_class") + " files"  + " created.")

Now use the function to create 100 files each for five fault types.

create_random_csv_files(["Fault_1", "Fault_2", "Fault_3", "Fault_4", "Fault_5"], number_of_files_in_each_class = 100)
100 Fault_1 files created.
100 Fault_2 files created.
100 Fault_3 files created.
100 Fault_4 files created.
100 Fault_5 files created.
files = glob.glob("./random_data/*")
print("Total number of files: ", len(files))
print("Showing first 10 files...")
files[:10]
Total number of files:  500
Showing first 10 files...

['./random_data/Fault_1_001.csv',
 './random_data/Fault_1_002.csv',
 './random_data/Fault_1_003.csv',
 './random_data/Fault_1_004.csv',
 './random_data/Fault_1_005.csv',
 './random_data/Fault_1_006.csv',
 './random_data/Fault_1_007.csv',
 './random_data/Fault_1_008.csv',
 './random_data/Fault_1_009.csv',
 './random_data/Fault_1_010.csv']

To extract labels from file name, extract the part of the file name that corresponds to fault type.

print(files[0])
./random_data/Fault_1_001.csv
print(files[0][14:21])
Fault_1

Now that data have been created, we will go to the next step. That is, create a custom dataloader, preprocess the time series like data into a matrix like shape such that a 2-D CNN can ingest it. We reshape the data in that way to just illustrate the point. Readers should use their own preprocessing steps.

2. Write a custom dataloader

We have to first create a Dataset class. Then we can pass the dataset to the dataloader. Every dataset class must implement the __len__ method that determines the length of the dataset and __getitem__ method that iterates over the dataset item by item. In our case, item would mean the processed version of a chunk of data.

The following dataset class takes a list of file names as first argument. The second argument is batch_size. batch_size determines how many files we will process at one go. As we will be solving a classification problem, we have to assign labels to each raw data. We will use following labels for convenience.

Class Label
Fault_1 0
Fault_2 1
Fault_3 2
Fault_4 3
Fault_5 4
import pandas as pd
import re   
import torch
from torch.utils.data import Dataset
print("PyTorch Version: ", torch.__version__)
PyTorch Version:  1.7.1
class CustomDataset(Dataset):
  def __init__(self, filenames, batch_size):
    # `filenames` is a list of strings that contains all file names.
    # `batch_size` determines the number of files that we want to read in a chunk.
        self.filenames= filenames
        self.batch_size = batch_size
  def __len__(self):
        return int(np.ceil(len(self.filenames) / float(self.batch_size)))   # Number of chunks.
  def __getitem__(self, idx): #idx means index of the chunk.
    # In this method, we do all the preprocessing.
    # First read data from files in a chunk. Preprocess it. Extract labels. Then return data and labels.
        batch_x = self.filenames[idx * self.batch_size:(idx + 1) * self.batch_size]   # This extracts one batch of file names from the list `filenames`.
        data = []
        labels = []
        label_classes = ["Fault_1", "Fault_2", "Fault_3", "Fault_4", "Fault_5"]
        for file in batch_x:
            temp = pd.read_csv(open(file,'r')) # Change this line to read any other type of file
            data.append(temp.values.reshape(32,32,1)) # Convert column data to matrix like data with one channel
            pattern = "^" + eval("file[14:21]")      # Pattern extracted from file_name
            for j in range(len(label_classes)):
                if re.match(pattern, label_classes[j]): # Pattern is matched against different label_classes
                    labels.append(j)  
        data = np.asarray(data).reshape(-1,1,32,32) # Because of Pytorch's channel first convention
        labels = np.asarray(labels)

        # The following condition is actually needed in Pytorch. Otherwise, for our particular example, the iterator will be an infinite loop.
        # Readers can verify this by removing this condition.
        if idx == self.__len__():  
          raise IndexError

        return data, labels

To read any other file format, inside the __getitem__ method change the line that reads files. This will enable us to read different file formats, be it .txt or .npz or any other. Preprocessing of data, different from what we have done in this blog, can be done within the __getitem__ method.

Now we will check whether the dataset works as intended or not. We will set batch_size to 10. This means that files in chunks of 10 will be read and processed. The list of files from which 10 are chosen can be an ordered file list or shuffled list. In case, the files are not shuffled, use np.random.shuffle(file_list) to shuffle files.

In the demonstration, we will read files from an ordered list. This will help us check any errors in the code.

check_dataset = CustomDataset(filenames = files, batch_size = 10)
check_dataset.__len__()
50
for i, (data, labels) in enumerate(check_dataset):
  print(data.shape, labels.shape)
  print(labels)
  if i == 5: break
(10, 1, 32, 32) (10,)
[0 0 0 0 0 0 0 0 0 0]
(10, 1, 32, 32) (10,)
[0 0 0 0 0 0 0 0 0 0]
(10, 1, 32, 32) (10,)
[0 0 0 0 0 0 0 0 0 0]
(10, 1, 32, 32) (10,)
[0 0 0 0 0 0 0 0 0 0]
(10, 1, 32, 32) (10,)
[0 0 0 0 0 0 0 0 0 0]
(10, 1, 32, 32) (10,)
[0 0 0 0 0 0 0 0 0 0]

Run the above cell multiple times to observe different labels. Label 1 appears only when all the files corresponding to “Fault_1” have been read. There are 100 files for “Fault_1” and we have set batch_size to 10. In the above cell we are iterating over the generator only 6 times. When number of iterations become greater than 10, we see label 1 and subsequently other labels. This will happen only if our initial file list is not shuffled. If the original list is shuffled, we will get random labels.

To train a deep learning model, we need to create a data loader from the dataset. Dataloaders offer multi-worker, multi-processing capabilities without requiring us to right codes for that. So let’s first create a dataloader from the dataset.

from torch.utils.data import DataLoader
dataloader = DataLoader(check_dataset,batch_size = None, shuffle = True) # Here we select batch size to be None as we have already batched our data in dataset.

Check whether dataloader works on not.

for i, (data,labels) in enumerate(dataloader):
    print(data.shape, labels.shape)
    print(labels)  # Just to see the labels.
    if i == 3: break
torch.Size([10, 1, 32, 32]) torch.Size([10])
tensor([4, 4, 4, 4, 4, 4, 4, 4, 4, 4])
torch.Size([10, 1, 32, 32]) torch.Size([10])
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
torch.Size([10, 1, 32, 32]) torch.Size([10])
tensor([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])
torch.Size([10, 1, 32, 32]) torch.Size([10])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Now that dataloader works, we will use it to train a simple deep learning model. The focus of this post is not on the model itself. So we will use a simplest model. If readers want a different model, they can do so by just replacing our model with theirs.

3. Feed chunks of data to a CNN model and train it for several epochs

But before we build the model and train it, we will first move our files to different folders depending on their fault type. We do so as it will be convenient later to create a training, validation, and test set from the data.

import shutil

Create five different folders one each for a given fault type.

fault_folders = ["Fault_1", "Fault_2", "Fault_3", "Fault_4", "Fault_5"]
for folder_name in fault_folders:
    os.mkdir(os.path.join("./random_data", folder_name))

Move files into those folders.

for file in files:
    pattern = "^" + eval("file[14:21]")
    for j in range(len(fault_folders)):
        if re.match(pattern, fault_folders[j]):
            dest = os.path.join("./random_data/",eval("fault_folders[j]"))
            shutil.move(file, dest)
glob.glob("./random_data/*")
['./random_data/Fault_1',
 './random_data/Fault_2',
 './random_data/Fault_3',
 './random_data/Fault_4',
 './random_data/Fault_5']
glob.glob("./random_data/Fault_1/*")[:10] # Showing first 10 files of Fault_1 folder
['./random_data/Fault_1/Fault_1_001.csv',
 './random_data/Fault_1/Fault_1_002.csv',
 './random_data/Fault_1/Fault_1_003.csv',
 './random_data/Fault_1/Fault_1_004.csv',
 './random_data/Fault_1/Fault_1_005.csv',
 './random_data/Fault_1/Fault_1_006.csv',
 './random_data/Fault_1/Fault_1_007.csv',
 './random_data/Fault_1/Fault_1_008.csv',
 './random_data/Fault_1/Fault_1_009.csv',
 './random_data/Fault_1/Fault_1_010.csv']
glob.glob("./random_data/Fault_3/*")[:10] # Showing first 10 files of Fault_3 folder
['./random_data/Fault_3/Fault_3_001.csv',
 './random_data/Fault_3/Fault_3_002.csv',
 './random_data/Fault_3/Fault_3_003.csv',
 './random_data/Fault_3/Fault_3_004.csv',
 './random_data/Fault_3/Fault_3_005.csv',
 './random_data/Fault_3/Fault_3_006.csv',
 './random_data/Fault_3/Fault_3_007.csv',
 './random_data/Fault_3/Fault_3_008.csv',
 './random_data/Fault_3/Fault_3_009.csv',
 './random_data/Fault_3/Fault_3_010.csv']

Prepare the data for training set, validation set, and test_set. For each fault type, we will keep 70 files for training, 10 files for validation and 20 files for testing.

fault_1_files = glob.glob("./random_data/Fault_1/*")
fault_2_files = glob.glob("./random_data/Fault_2/*")
fault_3_files = glob.glob("./random_data/Fault_3/*")
fault_4_files = glob.glob("./random_data/Fault_4/*")
fault_5_files = glob.glob("./random_data/Fault_5/*")
from sklearn.model_selection import train_test_split
fault_1_train, fault_1_test = train_test_split(fault_1_files, test_size = 20, random_state = 5)
fault_2_train, fault_2_test = train_test_split(fault_2_files, test_size = 20, random_state = 54)
fault_3_train, fault_3_test = train_test_split(fault_3_files, test_size = 20, random_state = 543)
fault_4_train, fault_4_test = train_test_split(fault_4_files, test_size = 20, random_state = 5432)
fault_5_train, fault_5_test = train_test_split(fault_5_files, test_size = 20, random_state = 54321)
fault_1_train, fault_1_val = train_test_split(fault_1_train, test_size = 10, random_state = 1)
fault_2_train, fault_2_val = train_test_split(fault_2_train, test_size = 10, random_state = 12)
fault_3_train, fault_3_val = train_test_split(fault_3_train, test_size = 10, random_state = 123)
fault_4_train, fault_4_val = train_test_split(fault_4_train, test_size = 10, random_state = 1234)
fault_5_train, fault_5_val = train_test_split(fault_5_train, test_size = 10, random_state = 12345)
train_file_names = fault_1_train + fault_2_train + fault_3_train + fault_4_train + fault_5_train
validation_file_names = fault_1_val + fault_2_val + fault_3_val + fault_4_val + fault_5_val
test_file_names = fault_1_test + fault_2_test + fault_3_test + fault_4_test + fault_5_test

# Shuffle training files (We don't need to shuffle validation and test data)
np.random.shuffle(train_file_names)
print("Number of train_files:" ,len(train_file_names))
print("Number of validation_files:" ,len(validation_file_names))
print("Number of test_files:" ,len(test_file_names))
Number of train_files: 350
Number of validation_files: 50
Number of test_files: 100

Create the datasets and dataloaders for training, validation, and test set.

batch_size = 10
train_dataset = CustomDataset(filenames = train_file_names, batch_size = batch_size)
val_dataset = CustomDataset(filenames = validation_file_names, batch_size = batch_size)
test_dataset = CustomDataset(filenames = test_file_names, batch_size = batch_size)

train_dataloader = DataLoader(train_dataset, batch_size = None, shuffle = True)
val_dataloader = DataLoader(val_dataset, batch_size = None)  # Shuffle is False by default.
test_dataloader = DataLoader(test_dataset, batch_size = None)

Now create the model. We will build one of the simplest models. Readers are free to choose a different model of their choice. If torchsummary is not installed, use pip install torchsummary to install it.

from torch.nn import Sequential, Conv2d, MaxPool2d, Flatten, Linear, ReLU, Softmax
from torchsummary import summary
dev = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = Sequential(
        Conv2d(in_channels = 1, out_channels = 16, kernel_size = 3),
        ReLU(),
        MaxPool2d(2),
        Conv2d(16,32,3),
        ReLU(),
        MaxPool2d(2),
        Flatten(),
        Linear(in_features = 1152, out_features=16),
        ReLU(),
        Linear(16, 5),
        Softmax(dim = 1)
)
model.to(dev)
summary(model,input_size = (1,32,32), device = dev.type)
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 16, 30, 30]             160
              ReLU-2           [-1, 16, 30, 30]               0
         MaxPool2d-3           [-1, 16, 15, 15]               0
            Conv2d-4           [-1, 32, 13, 13]           4,640
              ReLU-5           [-1, 32, 13, 13]               0
         MaxPool2d-6             [-1, 32, 6, 6]               0
           Flatten-7                 [-1, 1152]               0
            Linear-8                   [-1, 16]          18,448
              ReLU-9                   [-1, 16]               0
           Linear-10                    [-1, 5]              85
          Softmax-11                    [-1, 5]               0
================================================================
Total params: 23,333
Trainable params: 23,333
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.35
Params size (MB): 0.09
Estimated Total Size (MB): 0.44
----------------------------------------------------------------
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())
model = model.float()   # We will make all model parameters floats.
epochs = 10
for epoch in range(epochs):
    running_loss_train = 0.0
    running_loss_val = 0.0
    correct_train = 0.0
    correct_val = 0.0
    num_labels_train = 0.0
    num_labels_val = 0.0

    # Training loop
    for inputs, labels in train_dataloader:
        inputs, labels = inputs.to(dev), labels.type(torch.LongTensor).to(dev) # PyTorch expects categorical targets as LongTensor.
        optimizer.zero_grad()
        outputs = model(inputs.float())
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss_train = running_loss_train + loss.item()
        correct_train = correct_train + (torch.argmax(outputs,dim = 1) == labels).float().sum()
        num_labels_train = num_labels_train + len(labels)

    # Validation loop
    for inputs, labels in val_dataloader:
        inputs, labels = inputs.to(dev), labels.type(torch.LongTensor).to(dev) # PyTorch expects categorical targets as LongTensor.
        outputs = model(inputs.float())
        loss = criterion(outputs, labels)
        running_loss_val = running_loss_val + loss.item()
        correct_val = correct_val + (torch.argmax(outputs, dim = 1) == labels).float().sum()
        num_labels_val = num_labels_val + len(labels)

    train_accuracy = correct_train/num_labels_train
    val_accuracy = correct_val/num_labels_val
    print("Epoch:{}, Train_loss: {:.4f}, Train_accuracy: {:.4f}, Val_loss: {:.4f}, Val_accuracy: {:.4f}".\
        format(epoch, running_loss_train/len(train_dataloader), train_accuracy, running_loss_val/len(val_dataloader), val_accuracy))
Epoch:0, Train_loss: 1.6112, Train_accuracy: 0.2000, Val_loss: 1.6096, Val_accuracy: 0.2000
Epoch:1, Train_loss: 1.6096, Train_accuracy: 0.2000, Val_loss: 1.6097, Val_accuracy: 0.2000
Epoch:2, Train_loss: 1.6095, Train_accuracy: 0.2000, Val_loss: 1.6096, Val_accuracy: 0.2000
Epoch:3, Train_loss: 1.6090, Train_accuracy: 0.2000, Val_loss: 1.6097, Val_accuracy: 0.2000
Epoch:4, Train_loss: 1.6084, Train_accuracy: 0.2000, Val_loss: 1.6096, Val_accuracy: 0.2000
Epoch:5, Train_loss: 1.6075, Train_accuracy: 0.2000, Val_loss: 1.6099, Val_accuracy: 0.2000
Epoch:6, Train_loss: 1.6084, Train_accuracy: 0.2057, Val_loss: 1.6096, Val_accuracy: 0.2000
Epoch:7, Train_loss: 1.6045, Train_accuracy: 0.2000, Val_loss: 1.6101, Val_accuracy: 0.2000
Epoch:8, Train_loss: 1.6051, Train_accuracy: 0.2000, Val_loss: 1.6100, Val_accuracy: 0.2000
Epoch:9, Train_loss: 1.6001, Train_accuracy: 0.2057, Val_loss: 1.6126, Val_accuracy: 0.2200

Before we make any comments on training accuracy and validation accuracy, we should keep in mind that our original dataset contains only random numbers. So it would be better if we don’t interpret the results here.

Compute test score.

running_loss_test = 0.0
correct_test = 0.0
num_labels_test = 0.0
for inputs, labels in test_dataloader:
    inputs, labels = inputs.to(dev), labels.type(torch.LongTensor).to(dev) # Pytorch expects categorical targets as LongTensor.
    outputs = model(inputs.float())
    loss = criterion(outputs.to(dev), labels.to(dev))
    running_loss_test = running_loss_test + loss.item()
    correct_test = correct_test + (torch.argmax(outputs, dim = 1) == labels).float().sum()
    num_labels_test = num_labels_test + len(labels)

test_accuracy = correct_test/num_labels_test
print("Test_loss: {:.4f}, Test_accuracy: {:.4f}".format(running_loss_test/len(test_dataloader), test_accuracy))
Test_loss: 1.6130, Test_accuracy: 0.2100

4. How to make predictions?

Until now, we have evaluated our model on a kept out test set. For our test set, both data and labels were known. So we evaluated its performance. But oftentimes, for test set, we don’t have access to true labels. Rather, we have to make predictions on the data available. This is the case in online competitions where we have to submit our predictions on a test set for which we don’t know the labels. We will call this set (without any labels) the prediction set. This naming convention is arbitrary but we will stick with it.

If the whole of our prediction set fits into memory, we can just make prediction on this data and then use np.argmax() or torch.argmax() to obtain predicted class labels. Otherwise, we can read files in prediction set in chunks, make predictions on the chunks and finally append our result.

Yet another pedantic way of doing this is to write a separate dataset to read files from the prediction set in chunks and make predictions on it. We will show how this approach works. As we don’t have a prediction set yet, we will first create some files and save it to the prediction set.

def create_prediction_set(num_files = 20):
    os.mkdir("./random_data/prediction_set")
    for i in range(num_files):
        data = np.random.randn(1024,)
        file_name = "./random_data/prediction_set/"  + "file_" + "{0:03}".format(i+1) + ".csv" # This creates file_name
        np.savetxt(eval("file_name"), data, delimiter = ",", header = "V1", comments = "")
    print(str(eval("num_files")) + " "+ " files created in prediction set.")

Create some files for prediction set.

create_prediction_set(num_files = 55)
55  files created in prediction set.
prediction_files = glob.glob("./random_data/prediction_set/*")
print("Total number of files: ", len(prediction_files))
print("Showing first 10 files...")
prediction_files[:10]
Total number of files:  55
Showing first 10 files...

['./random_data/prediction_set/file_001.csv',
 './random_data/prediction_set/file_002.csv',
 './random_data/prediction_set/file_003.csv',
 './random_data/prediction_set/file_004.csv',
 './random_data/prediction_set/file_005.csv',
 './random_data/prediction_set/file_006.csv',
 './random_data/prediction_set/file_007.csv',
 './random_data/prediction_set/file_008.csv',
 './random_data/prediction_set/file_009.csv',
 './random_data/prediction_set/file_010.csv']

The prediction dataset will be slightly different from our previous custom dataset class. We only need to return data in this case.

class PredictionDataset(Dataset):
  def __init__(self, filenames, batch_size):
        self.filenames= filenames
        self.batch_size = batch_size
  def __len__(self):
        return int(np.ceil(len(self.filenames) / float(self.batch_size)))
  def __getitem__(self, idx):
        batch_x = self.filenames[idx * self.batch_size:(idx + 1) * self.batch_size]
        data = []
        labels = []
        label_classes = ["Fault_1", "Fault_2", "Fault_3", "Fault_4", "Fault_5"]
        for file in batch_x:
            temp = pd.read_csv(open(file,'r')) 
            data.append(temp.values.reshape(32,32,1)) 
        data = np.asarray(data).reshape(-1,1,32,32) 
        
        
        if idx == self.__len__():  
          raise IndexError

        return data

Check whether the dataset and dataloader work or not.

prediction_dataset = PredictionDataset(prediction_files, batch_size = 10)
prediction_dataloader = DataLoader(prediction_dataset,batch_size = None, shuffle = False)
for data in prediction_dataloader:
    print(data.shape)
torch.Size([10, 1, 32, 32])
torch.Size([10, 1, 32, 32])
torch.Size([10, 1, 32, 32])
torch.Size([10, 1, 32, 32])
torch.Size([10, 1, 32, 32])
torch.Size([5, 1, 32, 32])

Make predictions.

preds = []
for data in prediction_dataloader:
    data = data.to(dev)
    preds.append(model(data.float()))
preds = torch.cat(preds)
preds
tensor([[0.3369, 0.0306, 0.2113, 0.3813, 0.0399],
        [0.3747, 0.0238, 0.2525, 0.3083, 0.0407],
        [0.3462, 0.0353, 0.2434, 0.3220, 0.0531],
        [0.3387, 0.0365, 0.2338, 0.3394, 0.0516],
        [0.3619, 0.0246, 0.2021, 0.3783, 0.0331],
        [0.3302, 0.0431, 0.2429, 0.3209, 0.0629],
        [0.4018, 0.0178, 0.2334, 0.3161, 0.0308],
        [0.3479, 0.0335, 0.2398, 0.3288, 0.0501],
        [0.3279, 0.0465, 0.2430, 0.3162, 0.0665],
        [0.3299, 0.0396, 0.2703, 0.2957, 0.0645],
        [0.3538, 0.0263, 0.2408, 0.3382, 0.0409],
        [0.3306, 0.0415, 0.2074, 0.3691, 0.0514],
        [0.3377, 0.0326, 0.2195, 0.3661, 0.0441],
        [0.3326, 0.0387, 0.2409, 0.3324, 0.0554],
        [0.3321, 0.0376, 0.2649, 0.3054, 0.0600],
        [0.3368, 0.0385, 0.2396, 0.3286, 0.0565],
        [0.3596, 0.0239, 0.2409, 0.3377, 0.0379],
        [0.3905, 0.0235, 0.2188, 0.3307, 0.0365],
        [0.3396, 0.0298, 0.2374, 0.3488, 0.0444],
        [0.3534, 0.0248, 0.2546, 0.3260, 0.0412],
        [0.3356, 0.0336, 0.2101, 0.3772, 0.0435],
        [0.3255, 0.0501, 0.2172, 0.3441, 0.0632],
        [0.3375, 0.0318, 0.2428, 0.3409, 0.0471],
        [0.3309, 0.0345, 0.2925, 0.2799, 0.0621],
        [0.3575, 0.0294, 0.2304, 0.3385, 0.0443],
        [0.3312, 0.0428, 0.2192, 0.3513, 0.0556],
        [0.3382, 0.0355, 0.2282, 0.3489, 0.0493],
        [0.3400, 0.0287, 0.2491, 0.3374, 0.0448],
        [0.3407, 0.0410, 0.2238, 0.3386, 0.0559],
        [0.3529, 0.0316, 0.2259, 0.3444, 0.0452],
        [0.3413, 0.0346, 0.2100, 0.3699, 0.0442],
        [0.3432, 0.0274, 0.2159, 0.3754, 0.0380],
        [0.3319, 0.0403, 0.2334, 0.3386, 0.0559],
        [0.3323, 0.0377, 0.2615, 0.3090, 0.0595],
        [0.3351, 0.0355, 0.2241, 0.3571, 0.0482],
        [0.3420, 0.0367, 0.2103, 0.3636, 0.0474],
        [0.3271, 0.0416, 0.1838, 0.4018, 0.0456],
        [0.3345, 0.0272, 0.1773, 0.4302, 0.0308],
        [0.3489, 0.0246, 0.2447, 0.3428, 0.0389],
        [0.3360, 0.0338, 0.1997, 0.3890, 0.0414],
        [0.3340, 0.0365, 0.2511, 0.3235, 0.0550],
        [0.3622, 0.0207, 0.2225, 0.3632, 0.0314],
        [0.3674, 0.0261, 0.2247, 0.3425, 0.0393],
        [0.3371, 0.0320, 0.2391, 0.3452, 0.0465],
        [0.3620, 0.0271, 0.2215, 0.3501, 0.0393],
        [0.3303, 0.0415, 0.2512, 0.3150, 0.0620],
        [0.3315, 0.0402, 0.2165, 0.3600, 0.0517],
        [0.3358, 0.0389, 0.2567, 0.3086, 0.0600],
        [0.3580, 0.0276, 0.2110, 0.3652, 0.0382],
        [0.3577, 0.0290, 0.2221, 0.3494, 0.0418],
        [0.3450, 0.0297, 0.2698, 0.3044, 0.0512],
        [0.3370, 0.0325, 0.2324, 0.3520, 0.0461],
        [0.3694, 0.0259, 0.2063, 0.3626, 0.0358],
        [0.3308, 0.0414, 0.2087, 0.3674, 0.0517],
        [0.3410, 0.0344, 0.2436, 0.3302, 0.0508]], device='cuda:0',
       grad_fn=<CatBackward>)

Outputs of prediction are 5 dimensional vector. This is so because we have used 5 neurons in the output layer and our activation function is softmax. The 5 dimensional output vector for an input add to 1. So it can be interpreted as probability. Thus we should classify the input to a class, for which prediction probability is maximum. To get the class corresponding to maximum probability, we can use np.argmax() or torch.argmax() command.

torch.argmax(preds, dim = 1)
tensor([3, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 0, 3, 0, 3, 3, 3, 0,
        0, 3, 3, 0, 0, 0, 3, 3, 3, 0, 3, 3, 3, 3, 0, 3, 0, 3, 0, 3, 0, 0, 3, 0,
        3, 0, 0, 3, 0, 3, 0], device='cuda:0')

Remember that our data are randomly generated. So we should not be surprised by this result.

This brings us to the end of the blog. As we had planned in the beginning, we have created random data files, a custom dataloader, trained a model using that dataloader, and made predictions on new data. The above code can be tweaked slightly to read any type of files other than .csv. And now we can train our model without worrying about the data size. Whether the data size is 10GB or 750GB, our approach will work for both.

Also note that we have not used the multi-worker and multi-processing capabilities of dataloader. To further speedup the dataloading process, readers should take advantage of the multiprocessing capabilities of dataloader. The best way to choose optimum multiprocessing and multi-worker parameters is to try a few ones and see which set of parameters work best for the system under consideration.

As a final note, please keep in mind that the approach we have discussed in only one of many different ways in which we can read multiple files. I have chosen this approach as it seemed natural to me. I have neither strived for efficiency nor elegance. If readers have any better idea, I would be happy to know of it.

I hope this blog would be of help to reader. Please bring any errors or omissions to my notice.

Disclaimer: It is very likely that this blog might not have used some of the best practices of PyTorch. This is because the author has a superficial knowledge of PyTorch and is not aware of its best practices. The author (un)fortunately prefers Tensorflow.