What are the MongoDB - GridFS? - FutureFundamentals

If you are interested to learn about the Regular expression in MongoDB

GridFS is the MongoDB specification for storing and retrieving large files such as images, audio files, video files, etc. It is kind of a file system to store files but its data is stored within MongoDB collections. GridFS has the capability to store files even greater than its document size limit of 16MB.

GridFS divides a file into chunks and stores each chunk of data in a separate document, each of maximum size 255k.

GridFS by default uses two collections fs.files and fs.chunks to store the file’s metadata and the chunks. Each chunk is identified by its unique _id ObjectId field. The fs.files serves as a parent document. The files_id field in the fs.chunks document links the chunk to its parent.

Following is a sample document of fs.files collection −

{
   "filename": "test.txt",
   "chunkSize": NumberInt(261120),
   "uploadDate": ISODate("2014-04-13T11:32:33.557Z"),
   "md5": "7b762939321e146569b07f72c62cca4f",
   "length": NumberInt(646)
}

The document specifies the file name, chunk size, uploaded date, and length.

Following is a sample document of fs.chunks document −

{
   "files_id": ObjectId("534a75d19f54bfec8a2fe44b"),
   "n": NumberInt(0),
   "data": "Mongo Binary Data"
}

Adding Files to GridFS

Now, we will store an mp3 file using GridFS using the put command. For this, we will use the mongofiles.exe utility present in the bin folder of the MongoDB installation folder.

Open your command prompt, navigate to the mongofiles.exe in the bin folder of MongoDB installation folder and type the following code −

>mongofiles.exe -d gridfs put song.mp3

Here, gridfs is the name of the database in which the file will be stored. If the database is not present, MongoDB will automatically create a new document on the fly. Song.mp3 is the name of the file uploaded. To see the file’s document in database, you can use find query −

>db.fs.files.find()

The above command returned the following document −

{
   _id: ObjectId('534a811bf8b4aa4d33fdf94d'), 
   filename: "song.mp3", 
   chunkSize: 261120, 
   uploadDate: new Date(1397391643474), md5: "e4f53379c909f7bed2e9d631e15c1c41",
   length: 10401959 
}

We can also see all the chunks present in fs.chunks collection related to the stored file with the following code, using the document id returned in the previous query −

>db.fs.chunks.find({files_id:ObjectId('534a811bf8b4aa4d33fdf94d')})

In my case, the query returned 40 documents meaning that the whole mp3 document was divided in 40 chunks of data.

GridFS Collections MongoDB GridFS Indexes

For efficiency, GridFS employs indexes on each of the chunks and file collections. For convenience, drivers that adhere to the GridFS specification automatically build these indexes.

This specification defines a simple GridFS API. This specification also describes advanced GridFS capabilities that drivers may choose to offer in their implementations. Additionally, this work seeks to define the meaning and purpose of all fields in the GridFS data model, disambiguate GridFS nomenclature, and document previously unspecified configuration choices. You can also add as many indexes as you need to meet the needs of your application.

The Chunks Index

GridFS uses the files_id and n fields to create a unique compound index on the chunks collection. This enables efficient chunk retrieval, as shown in the following example:

db.fs.chunks.find( { files_id: myFileID } ).sort( { n: 1 } )

Drivers that follow the GridFS specification will automatically check for the existence of this index before performing read and write operations. For information on the unique behavior of your GridFS application, consult the corresponding driver documentation.

If this index does not exist, you can issue the following operation to create it using The MongoDB Shell (mongosh)., It’s a complete JavaScript and Node.js 14.x REPL environment for working with MongoDB deployments. You may use the MongoDB Shell to directly test queries and actions against your database.

db.fs.chunks.createIndex( { files_id: 1, n: 1 }, { unique: true } );

The Files Index

It makes use of an index on the files collection based on the filename and UploadDate columns. It enables efficient file retrieval, as illustrated in the following example:

db.fs.files.find( { filename: myFileName } ).sort( { uploadDate: 1 } )

If this index does not already exist, you can use mongo shell to build it:

db.fs.files.createIndex( { filename: 1, uploadDate: 1 } );

MongoDB GridFS Sharding 

GridFS is divided into two collections: files and chunks.

Chunks Collection

Chunks stores the binary chunks. Use either { files_id: 1, n: 1 } or { files_id: 1 } as the shard key index to shard the chunks collection. files_id is an ObjectId that updates in a monotonic manner.

You cannot utilize Hashed Sharding if the MongoDB driver uses filemd5.

Each document in the chunks collection represents a unique chunk of a file in GridFS. This collection’s documents take the following format:

{ 
  "_id" : <ObjectId>, 
  "files_id" : <ObjectId>, 
  "n" : <num>, 
  "data" : <binary> 
}

The following fields are included in some or all of the documents in the chunks collection:

chunks._id: Unique ObjectId.
chunks.files_id: In the files collection, we can specify the _id of the parent document.
chunks.n: The chunk’s sequence number. GridFS assigns a number to each chunk, beginning with 0.
chunks.data: The payload of the chunk as a BSON Binary type.

Files Collection

Files’ stores the file’s metadata. The file collection is minimal and consists mainly of metadata. GridFS keys do not lend themselves to equitable distribution in a sharded system. This allows all of the file metadata records to reside on a single primary shard.

If you need to shard the files collection, utilize the _id field in association with an application field.

Each document in the file collection represents a file in GridFS.

{
  "_id" : <ObjectId>,
  "length" : <num>,
  "chunkSize" : <num>,
  "uploadDate" : <timestamp>,
  "md5" : <hash>,
  "filename" : <string>,
  "contentType" : <string>,
  "aliases" : <string array>,
  "metadata" : <any>,
}

The following fields are included in some or all of the documents in the files collection:

files.length: The document’s size in bytes.
files._id: The _id is of the data type you specified when creating the original document. BSON ObjectId is the default type for MongoDB documents.
files.chunkSize: Each chunk’s size in bytes. Except for the last chunk, which is only as large as needed, GridFS breaks the document into chunks of size chunkSize. The standard size is 255 kilobytes (kB).
files.uploadDate: GridFS’s initial storage of the document. The type of this value is Date.
files.md5: The filemd5 command returns an MD5 hash of the entire file. It is of the string type.
files.metadata: The metadata field can contain any type of data and any additional information you choose to store. If you want to add more arbitrary fields to documents in the files collection, add them to a metadata object.
files.aliases: An array of alias strings.
files.contentType: It is entirely optional. A MIME type that is appropriate for the GridFS file.
files.filename: It is entirely optional. The GridFS file’s human-readable name.

Example:

{
"_id" : ObjectId("6177da181964fd7f82e2aaa9"),
"length" : 15720,
"chunkSize" : 261120,
"uploadDate" : ISODate("2021-10-26T16:06:08.091+05:30"),
"filename" : "ishanfile.docx",
"contentType" : "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
}

The files collection, like the chunks collection, employs a compound index based on the filename and uploadDate columns to enable for efficient file retrieval, for example:

db.fs.files.find( { filename: fileName } ).sort( { uploadDate: 1 } )

If this index does not exist, run the following command in a mongo shell:

db.gfs.file.createIndex( { filename: 1, uploadDate: 1 }, { unique: true } );

This will give the output as:

How to Read and Write files in MongoDB GridFS?

To follow the tutorial further, your machine must have following software installed:

Node.js
MongoDB with MongoDB Compass
VS Code

Step 1: Make a folder named mongo_grid. Launch the VSCode editor and navigate to this folder. This folder will be transformed into a workspace, containing all of the code files contained within it.

Step 2: In this workspace, create folders titled filestoread and filestowrite, which will contain files that will be read and saved into a database, as well as files read from the database.

Step 3: Open the VS Code terminal and run npm init –y

This command will create a workspace package.json file with certain preset sections.

Install gridfs-stream and mongoose using the following command:

npm install gridfs-stream
npm install mongoose

In the devDependencies section of the package.json file, define the following packages:

The gridfs-stream package allows you to effortlessly stream files to and from MongoDB GridFS. The mongoose package contains the MongoDB object modelling tool, which is meant to function in an asynchronous environment to conduct operations on the MongoDB database.

Step 4: Maintain the following project folder structure:

Put a few images/videos/audios in the filestoread folder. These files will be utilized for writing and reading activities. A sample gfs.png file is utilized in this example.

Step 5: Open the MongoDB Compass and connect to the MongoDB Database. Create a database with the name filesDB and collection named files.

Step 6: For writing a file in GridFS, create a javascript file and name it as writefile.js and write this code in the file:

//1. Load the mongoose driver
var mongooseDv = require("mongoose");
//2. Connect to MongoDB and its database
mongooseDv.connect('mongodb://localhost/filesDB', { useMongoClient: true });
//3. The Connection Object
var connection = mongooseDv.connection;
if (connection !== "undefined") {
    console.log(connection.readyState.toString());
    //4. The Path object
    var path = require("path");
    //5. The grid-stream
    var grid = require("gridfs-stream");
    //6. The File-System module
    var fs = require("fs");
    //7.Read the video/image file from the videoread folder
    var filesrc = path.join(__dirname, "./filestoread/gfs.png");
    //8. Establish connection between Mongo and GridFS
    grid.mongo = mongooseDv.mongo;
    //9.Open the connection and write file
    connection.once("open", () => {
        console.log("Connection Open");
        var gridfs = grid(connection.db);
        if (gridfs) {
            //9a. create a stream, this will be
            //used to store file in database
            var streamwrite = gridfs.createWriteStream({
                //the file will be stored with the name
                filename: "gfs.png"
            });
            //9b. create a readstream to read the file
            //from the filestored folder
            //and pipe into the database
            fs.createReadStream(filesrc).pipe(streamwrite);
            //9c. Complete the write operation
            streamwrite.on("close", function (file) {
                console.log("successfully written in database");
            });
        } else {
            console.log("No Grid FS Object");
        }
    });
} else {
    console.log('Not connected');
}
console.log("done");

The file from the filestoread folder is supplied as a parameter to the fs module’s createReadStream() function. The write-stream formed with the gridfs object is accepted by the pipe() function. This stream is intended for use with the image file.

Step 7: Run the code using node writefile

This will give the following output:

Now check the MongoDB Compass and the data in the filesDB will look like:

You can view the file in fs.files:

Step 8: For reading a file, create a javascript file and name it readfile.js:

var mongooseDv = require("mongoose");
var schema = mongooseDv.Schema;
mongooseDv.connect('mongodb://localhost/filesDB', { useMongoClient: true });
var connection = mongooseDv.connection;
if (connection !== "undefined") {
    console.log(connection.readyState.toString());
    var path = require("path");
    var grid = require("gridfs-stream");
    var fs = require("fs");
    var videosrc = path.join(__dirname, "./filestowrite/videos.mp4");
    grid.mongo = mongooseDv.mongo;
    connection.once("open", () => {
        console.log("Connection Open");
        var gridfs = grid(example.db);
        if (gridfs) {
            var fsstreamwrite = fs.createWriteStream(
                path.join(__dirname, "./filestowrite/gfs.png")
            );
            var readstream = gridfs.createReadStream({
                filename: "gfs.png"
            });
            readstream.pipe(fsstreamwrite);
            readstream.on("close", function (file) {
                console.log("File Read successfully from database");
            });
        } else {
            console.log("No Grid FS Object");
        }
    });
} else {
    console.log(Not connected');
}
console.log("done");

Step 9: Run the above code using node readfile

This will give the following output:

This will read the file from the MongoDB GridFS and write the file to the filestowrite folder:

When to Use the MongoDB GridFS Storage System

The MongoDB GridFS storage system is not widely utilized, although the following conditions may demand its use:

When the present file system has a restriction on the number of files that can be stored in a given directory.
When only a portion of the information saved has to be accessed, GridFS allows you to recall sections of the file without having to examine the entire document.
When distributing files and their metadata via geographically distributed replica sets, GridFS allows the metadata to automatically sync and deploy data across numerous targeted computers.

When Not to Use the MongoDB GridFS Storage System

GridFS should not be used if you need to update the entire file’s content. As an alternative, you can keep numerous copies of each file and specify the most recent version in the metadata. After uploading the new version of the file, you can use an atomic update to update the metadata field that indicates “latest” status, and then remove older versions if necessary.

And if your files are all less than the BSON Document Size restriction of 16 MB, consider storing each file in a single document rather than utilizing GridFS. To store binary data, you can use the BinData data type. For further information on utilizing BinData, consult your driver’s documentation.

MongoDB GridFS Limitations

The GridFS File System has the following restrictions:

Serving files alongside database content might severely deplete your RAM working set. If you don’t want to disrupt your working set, you should serve your files from a different mongodb server.
File serving performance will be slower than serving the file natively from your webserver and filesystem. However, the additional management benefits may outweigh the slowdown.
GridFS does not support atomic file updates. If this scenario occurs, you will need to keep various versions of your files and select the appropriate version.

The power and rise of GridFS

GridFS is a gift for developers who want to store huge files in MongoDB. The GridFS storage system allows developers to store big files and retrieve portions of those files as needed. As a result, GridFS is an outstanding MongoDB feature that can be used with a variety of applications. The true benefit of this method is that only a piece of the file can be read without having to load the complete file into memory. This makes GridFS an extremely useful tool for modern applications.

What are the MongoDB – GridFS?

What are the MongoDB – GridFS?

Adding Files to GridFS

GridFS Collections MongoDB GridFS Indexes

The Chunks Index

The Files Index

MongoDB GridFS Sharding

How to Read and Write files in MongoDB GridFS?

When to Use the MongoDB GridFS Storage System

When Not to Use the MongoDB GridFS Storage System

MongoDB GridFS Limitations