What is Simple Storage Service (S3)
Amazon S3 is online bulk storage which can be access from almost any device. The storage is highly scalable, reliable, fast and inexpensive and can be used to store any type of file.
AWS achieves such high levels of durability and availability from their S3 service as objects are synced across all availability zones within a region when they’re uploaded. As we discussed previously, all availability zones should be isolated from the problems faced by other availability zones, creating excellent data redundancy.
Access to S3 is managed through an IAM. Policies can be applied to users, groups or roles to enable access by AWS users or the applications running in AWS.
AWS uses the concept of buckets in which to store your data. A bucket is essentially a root level folder and can contain sub-folders. Let’s look at what that could mean in the context of your local computer.
On your local computer, you probably have a folder called ‘Pictures’. This would be the ‘root’ folder, or bucket. Inside this, you may have a folder called ‘Christmas Photos’. This is referred to as a folder within that bucket (also known as a namespace). Anything stored within either the bucket or a sub-folder will be referred to as an object.
Where does S3 fit in your environment?
As you can see from the below diagram, S3 sits outside of the VPC but is still controlled by IAM and can be monitored by Cloudtrail and Cloudwatch.
Within your environment, you could use S3 as a simple document store; a backup solution; storage for your application (accessed by EC2 instances) or for any type of storage that require scalability & resilience.
It’s important to note that all bucket names must be unique across AWS. So, if you create a bucket called My-Bucket, no other user in the world can have the same bucket name. This does mean you may have to get a little creative with your bucket naming conventions to find a bucket name that hasn’t already been taken.
Once you’ve chosen an available name, you must also remember that S3 buckets are created in specific regions. Any data you upload will then exist in that region. In order to reduce latency, you should place your data in a region closest to the customers you’re serving.
You can control who can access and use your bucket at a very granular level. At bucket level, you can allow users to access the bucket and whether you want them to be able to upload / download items from the bucket.
At object level, you can also control access permissions. So, just because someone has access to the bucket, doesn’t mean they’ll be able to download every file within it.
You can also create a publicly accessible link using the actions menu for any object in S3, so that it can be shared with non-AWS users.
- All buckets and objects are private by default (except for the bucket owner)
- Objects in a bucket can be between 0 bytes and 5 terabytes in size
- Bucket names must be a minimum of 3 and a maximum of 63 characters in length and can only contain lowercase letters, numbers and hyphens
- You can have a maximum of 100 buckets per AWS account
- Bucket ownership cannot be transferred
- We can control access and user privileges through bucket policies. These are JSON scripts.
|Side Note: |
For new objects uploaded to S3, AWS supports read-after-write consistency, which means an object is immediately available once uploaded to S3.
For overwrites of an existing object or deleting an existing object, AWS supports eventual consistency, which means that there can be a slight delay before the update reflects for all users.
All items within an S3 bucket can be encrypted. This happens in one of two ways:
|Server Side Encryption||Server side encryption is where AWS encrypts the files before saving them to S3 and decrypts them when they’re downloaded.|
|Client Side Encryption||Client side encryption is where you use your own encryption keys to encrypt the file before you upload it to S3 and decrypt it once it’s downloaded. You are responsible in this scenario for looking after your own keys.|
Amazon publishes its pricing on their website & they are frequently updated. With the S3 service, you will be charged for the items you store at a cost per gigabyte.
You’ll also be charged for certain types of request (moving data in / out of S3). These include: put, copy, post, list and get in addition to lifecycle transition, data retrieval, data archive and data restore requests.
Ensure you check the AWS website for pricing before utilizing their services.
The below table includes each of the available storage classes, along with their durability, availability and cost. First, let’s define durability and availability:
Durability (fault tolerance) is the chance an object will not be lost in a given year. So, 99.999999999% (eleven nines) durability would mean that there is a 0.000000001% chance that a file will be lost in a year. Or, you could say that if you were storing 10,000 objects, you’d lose one object every 10 million years.
Availability is the percentage of time that a file will be available. So, if you have 99.99% availability, you can expect to have 1 hour where the file is unavailable for every 10,000 hours.
|Standard||All purpose, default storage||99.999999999%||99.99%||Highest|
|RRS||For non-critical, reproducible objects||99.99%||99.90%||High|
|Infrequent Access (S3-IA)||For files you don’t access frequently. Immediately available when you do need them.||99.999999999%||99.90%||Medium|
|Glacier||Archive storage. Up to 1 day to retrieve files stored||99.999999999%||NA||Low|
You can switch between standard, reduced redundancy and infrequent access storage at any time. However, to switch to Glacier, you must apply lifecycle rules to your S3 objects. The move to Glacier can take 1 or 2 days to take effect.
Object lifecycles are a set of rules that define what happens to an object in an S3 bucket at certain time intervals.
For example: let’s say you have a file that is for the current month’s budget for your company. At the moment, you’re working on that file every day, so you need this to have very high availability. At the end of the month, you’ll be accessing it once per week to for your ‘actual’ spend profile for the rest of the year. At the end of the year, you’ll probably never open the file but must retain it for audit purposes.
So, in this scenario, you could setup the following lifecycle policy:
- Standard Storage until day 30
- Infrequent Access storage from day 30 to the end of the year
- Glacier storage until the file needs to be deleted
Object lifecycles help us to keep our cost of storage as low as possible while retaining the accessibility and durability that we require.
These lifecycle policies can be applied to the entire S3 bucket or a specific folder / file within the bucket. You can delete the policy at any time and manually change the storage class back to the class you require.
Objects can be versioned in AWS. This is where AWS tracks and stores all versions of your object, so that you can always access older versions of that object.
It’s important to note that versioning is either on or off. It applies to the entire bucket and all objects held within it. Once you’ve turned versioning on, you can’t turn it off – you can only stop it retaining versions from this point forward, all older versions will remain available on AWS.
Of course, by saving older versions of your objects you will increase your storage usage, which will increase your storage costs. However, versioning can be thought of as a comprehensive backup tool for your business and therefore has inherent value.
To combat the increased cost, you can create lifecycle policies to work hand in hand with versioning to control the number of versions stored in your S3 bucket.
S3 event notifications allow you to setup automated communications between S3 and other AWS services when an event happens on S3. These events can include:
- The loss of an object from RRS
- Put, Post, Copy
- Completion of a multi part upload
These events can then trigger an SNS topic, Lambda function or SQS queue function to carry out a task based on that event. We will discuss each of these functions later in the book.
Static Website Hosting
This becomes particularly useful when it comes to serving error pages for your web application. Let’s say you have an EC2 instance running your application & the instance goes down. We can use Route 53 to serve static pages from S3 rather than simply showing a 500 error as your server is unreachable.
Getting data into and out of AWS
|Single Operation Upload||This is a single upload, as you would do through the console in AWS. This can be used for files of up to 5GB, but ideally multi-part upload should be used for all files over 100MB.||Up to 5GB|
|Multi Part Upload||This is where you break your file into many small ‘chunks’ of data. These can then be uploaded to AWS in parallel. If one part of the upload fails, you can re-transmit just that part.||Must be used for 5GB and larger and up to 5TB.|
|AWS Import / Export||This service allows you to physically mail a hard drive full of your data to AWS. Once received, they will upload it to Amazon S3 within 1 business day. If you need the data back (e.g. your on-premise network fails), you can ask AWS to mail it back to you.||Up to 16TB per job|
|Snowball||Snowball is very similar to the AWS import / export service except instead of sending AWS one of your hard drives, they will send you one of their very high capacity (petabytes) drives. You can then send it back to them & they will upload the data to S3.||Petabytes|
Storage Gateways – Hybrid Solution
There are two major types of storage gateway available in AWS. They are outlined below:
- Gateway Cached Volumes: This is where all of your data is stored in AWS S3 but frequently accessed files are cached locally for quick access
- Gateway Stored Volumes: This is where all of your data remains on-premise but AWS takes periodic snapshots to create incremental backups of your data in S3.
|Side Note: |
We can run analytics against the ELB log files by utilizing S3 to store the log files and EMR to process them.