AWS S3 Multipart Upload using AWS CLI

Lab Details

  1. This Lab walks you through the steps on how to upload a file to an S3 bucket using multipart uploading. 

  2. Duration: 01:00:00 Hrs

  3. AWS Region: US East (N. Virginia) us-east-1

Introduction

What is S3?

  • S3 stands for Simple Storage Service.

  • It provides object storage through a web service interface.

  • Each object is stored as a file with its metadata included and is given an ID number.

  • Objects uploaded to S3 are stored in containers called Buckets, whose names are unique and they organize the Amazon S3 namespace at the highest level.

  • These buckets are region specific.

  • You can assign permissions to these buckets, in order to provide access or restrict data transaction.

  • Applications use this ID number to access an object.

  • Developers can access an object via a REST API.

  • Supports upload of objects.

  • Uses the same scalable storage infrastructure that Amazon.com uses to run its global e-commerce network.

  • Designed for storing online backup and archiving of data and applications on AWS.

  • It's mainly designed with the minimal features in order to create web-scale computing in an easy way.

  • Storage classes provided are:

    • Standard

    • Standard_IA i.e., Standard Infrequent Access

    • Intelligent_Tiering

    • OneZone_IA

    • Glacier

    • Deep_Archive

    • RRS i.e., Reduced Redundancy Storage (Not recommended by AWS)

  • Data access is provided through S3 console, which is a simple web-based interface.

  • Data stored can be either Public or Private based on user requirement.

  • Data stored can be encrypted.

  • We can define life-cycle policies which can help in automation of data transfer, retention and deletion.

  • Amazon Athena can be used to query S3 data as per demand.

What is EC2?

  • AWS defines it as Elastic Compute Cloud.

  • It’s a virtual environment where “you rent” to have your environment created, without purchasing.

  • Amazon refers to these virtual machines as Instances.

  • Preconfigured templates can be used to launch instances. These templates are referred to as images. Amazon provides these images in the form of AMIs (Amazon Machine Images).

  • Allows you to install custom applications and services.

  • Scaling of infrastructure i.e., up or down is easy based on the demand you face.

  • AWS provides multiple configurations of CPU, memory, storage etc., through which you can pick the flavor that's required for your environment.

  • No limitation on storage. You can pick the storage based on the type of the instance that you are working on.

  • Temporary storage volumes are provided, which are called Instance Store Volumes.  Data stored in this gets deleted once the instance is terminated.

  • Persistent storage volumes are available and are referred to as EBS (Elastic Block Store) volumes.

  • These instances can be placed at multiple locations which are referred to as Regions and Availability Zones (AZ).

  • You can have your Instances distributed across multiple AZs i.e., within a single Region, so that if an instance fails, AWS automatically remaps the address to another AZ.

  • Instances deployed in one AZ can be migrated to another AZ.

  • To manage instances, images, and other EC2 resources, you can optionally assign your own metadata to each resource in the form of tags.

  • A Tag is a label that you assign to an AWS resource.  It contains a key and an optional value, both of which are defined by you.

  • Each AWS account comes with a set of default limits on the resources on a per-Region basis.

  • For any increase in the limit you need to contact AWS.

  • To work with the created instances, we use Key Pairs.

Uploading and copying objects using multipart upload

  • Multipart upload allows you to upload a single object as a set of parts. 

  • Each part is a contiguous portion of the object's data. 

  • You can upload these object parts independently and in any order. 

  • If transmission of any part fails, you can retransmit that part without affecting other parts. 

  • After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object. 

Note : When your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation.

When to use multipart upload

  • If you're uploading large objects over a stable high-bandwidth network, use multipart upload to maximize the use of your available bandwidth by uploading object parts in parallel for multi-threaded performance.

  • If you're uploading over a spotty network, use multipart upload to increase resiliency to network errors by avoiding upload restarts. 

  • When using multipart upload, you need to retry uploading only parts that are interrupted during the upload. You don't need to restart uploading your object from the beginning.

Multipart upload process

  • Multipart upload is a three-step process: 

    • Multipart upload initiation : When you send a request to initiate a multipart upload, Amazon S3 returns a response with an upload ID, which is a unique identifier for your multipart upload. You must include this upload ID whenever you upload parts, list the parts, complete an upload, or stop an upload. If you want to provide any metadata describing the object being uploaded, you must provide it in the request to initiate multipart upload.

    • Parts upload : When uploading a part, in addition to the upload ID, you must specify a part number. You can choose any part number between 1 and 10,000. A part number uniquely identifies a part and its position in the object you are uploading. 

    • Multipart upload completion : When you complete a multipart upload, Amazon S3 creates an object by concatenating the parts in ascending order based on the part number. If any object metadata was provided in the initiate multipart upload request, Amazon S3 associates that metadata with the object. After a successful complete request, the parts no longer exist.

Architecture Diagram

Task Details

  1. Log in to the AWS Management Console.

  2. Create an IAM Role

  3. Create an S3 bucket

  4. Create an EC2 instance

  5. SSH into the EC2 instance

  6. View the original file in EC2

  7. Split the original file

  8. Create a Multipart upload

  9. Uploading the file chunks

  10. Create a Multipart JSON file

  11. Complete the Multipart Upload

  12. View the file in the S3 Bucket

  13. Validation of the lab