System Architect, Stephen Jefferson, explains how the creation of lifecycle policies can lessen storage costs whilst improving data management, delivering the first of what will be a series of posts by our crack SysOps team.
Without maintaining and performing housekeeping on S3, you'll quickly find your S3 buckets filling with data. Over time, a large volume of this data will become redundant or low priority (regarding access), which results in unnecessary use of certain storage classes; while the data may no longer be accessed by users, you're paying for it as though it frequently is.
AWS provide ways to retain this data and almost compartmentalise it into different types of S3 storage classes. Before explaining how to perform such housekeeping, it's important first to review the different storage classes available.
S3 Standard is one of the most commonly used storage classes and provides a general purpose storage solution for users who require frequent and rapid access to their data. This is particularly useful for hosting websites, and content for web and mobile applications.
S3 Standard-IA (Infrequent Access) is for users who require infrequent but rapid access to data; this is a more appropriate solution for long-term storage such as backups and older data, and comes with a lower cost for storage and retrieval.
S3 One Zone-IA (Infrequent Access) is similar to S3 Standard-IA, though instead of being accessible across multiple availability zones, is only available in one. Opting for this choice is rewarded with a 20% reduction in cost, but does reduce availability and resilience.
S3 Glacier is a low-cost storage solution for storing long-term backups or archived data that needs to be retained. This provides comprehensive security and compliance capabilities, but dramatically reduces the accessibility of the data; the options offered for retrieving data can take between a few minutes to several hours.
The Lifecycle Policy
During the lifecycle of any system, usage patterns will emerge. A typical example of this is recently created data being accessed frequently by users, older data accessed less often and the oldest data not at all. Storing all of this data in the S3 Standard storage class would be wasteful and incur higher costs than other available options.
Conveniently, AWS provides a feature within S3 known as a lifecycle policy. The feature provides the functionality to be able to cycle your data between different storage classes at specified intervals, helping to address the logistical issue that presents itself as a result of the pattern mentioned above.
An example of a lifecycle policy set up to automate data storage by frequency of access.
A lifecycle policy can be set up where more recent data is stored in an S3 Standard storage class, and only once the data changes into a state of being accessed less frequently would it be moved into one of the Infrequent Access storage classes, lowering the total cost of storing the data. When this data is no longer being accessed and is therefore no longer needed, it can be archived and moved into the S3 Glacier storage class.
Setting up your Lifecycle Policy
To set up a lifecycle policy, go into the S3 section within the AWS console and select your desired S3 bucket. Click on the management tab, and you will be presented with a lifecycle tab. This is where you'll view the lifecycle of your S3 bucket; if you have not set one up before, this will be empty.
Click 'Add lifecycle rule', and a modal will be presented where you can name and add tags to your lifecycle rule (it is good practice to add tags to all created items in AWS).
The next modal screen is where you can set up your S3 storage class rules. You can enable this for both the current and previous version of an S3 object. To set up the structure as per the previous example, we'd add the following transitions to the lifecycle rule:
The AWS modal screen for creation of storage class rules.
In this lifecycle policy, we have added transitions for both current and previous versions of an object which follow the rules proposed earlier.
In the next screen, the expiration screen, we can set whether we want to expire the content. As we'd be keeping the data in Amazon Glacier, we will not be expiring the content. However, there is one option in here which we will enable, the “Clean up incomplete multipart uploads” option, which allows us to set a time limit in which if a multipart upload hasn’t completed successfully in the time range stated, it will be cleaned up out the bucket.
Once we've finished configuring the lifecycle policy, we can move on to the review screen, before reviewing and saving.
That’s it! The policy should now be live and available in the lifecycle dashboard for the S3 bucket, and you now have an automated cleanup process for your S3 bucket.
Don’t know your data patterns?
If you don’t know the pattern of your data (for example, if you're asking yourself 'when is my data accessed less frequently?'), there is a dedicated storage class available, available at a small monthly monitoring and tier cycling cost.
The Amazon S3 Intelligent-Tiering that was released recently, introduced for unpredictable data access patterns. The Intelligent Tiering class monitors the data and automatically moves it between the frequent and infrequent access classes depending on if the S3 object has been accessed or not. For example, where an object hasn’t been accessed in X days, it will be moved into the infrequent access class. If the object is then accessed once more, it will be pushed back into the frequent access class.
In the next transmission from hedgehog lab's SysOps team, Senior System Architect, Russell Collingham, will walk you through strategies that will help you bounce back following a cloud based disaster.