Despite the rock star status of S3, this chapter still covers all four AWS storage services (drumroll, please):
✓ Simple Storage Service (S3): Provides highly scalable object storage in the form of unstructured collections of bits
✓ Elastic Block Storage (EBS): Provides highly available and reliable data volumes that can be attached to a virtual machine (VM), detached, and then reattached to another VM
✓ Glacier: A data archiving solution; provides low-cost, highly robust archival data storage and retrieval
✓ DynamoDB: Key-value storage; provides highly scalable, highperformance storage based on tables indexed by data values referred
to as keys
Put simply, the enormous growth of storage makes traditional approaches
(local storage, network-attached storage, storage-area networks, and the like)
no longer appropriate, for these three reasons:
✓ Scaling: Traditional methods simply can’t scale large enough to handle the volume of data that companies now generate. The amounts of data that companies must manage outstrip the capabilities of almost all storage solutions.
✓ Speed: They can’t move data fast enough to respond to the demands that companies are placing on their storage solutions. To be blunt, most corporate networks cannot handle the level of traffic required to shunt around all the bits that companies store.
✓ Cost: Given the volumes of data being addressed, the established solutions aren’t economically viable — they’re unaffordable at the scale that companies now require.
For these reasons, the issue of storage has long since moved beyond localstorage (for example, disk drives located within the server using the data).Over the past couple decades, two other forms of traditional storage haveentered the market — network-attached storage (NAS) and storage-areanetworks (SAN) — which move storage from the local server to within thenetwork on which the server sits. When the server requires data, rather thansearch a local disk for it, it seeks it out over the network.
Both types of storage continue to be widely used, but the much larger volumes of data make neither NAS nor SAN storage able to support requirements. Consequently, newer storage types have come to the fore that provide better functionality.
In particular, two new storage types are now available:
✓ Object: Reliably stores and retrieves unstructured digital objects
✓ Key-value: Manages structured data
Object storage
Object storage provides the ability to store, well, objects — which are essentially collections of digital bits. Those bits may represent a digital photo, an MRI scan, a structured document such as an XML file — or the video of your cousin’s embarrassing attempt to ride a skateboard down the steps at the public library (the one you premiered at his wedding).
Object storage offers the reliable (and highly scalable) storage of collections of bits, but imposes no structure on the bits. The structure is chosen by the user, who needs to know, for example, whether an object is a photo (which can be edited), or an MRI scan (which requires a special application for viewing it). The user has to know both the format as well as the manipulation methods of the object. The object storage service simply provides reliable storage of the bits.
Object storage differs from file storage, which you may be more familiar with from using a PC. File storage offers update functionality, and object storage does not. For example, suppose you are storing logging output from a program. The program constantly adds new logging entries as events occur; creating a new object each time an additional log record is created would be incredibly inconvenient. By contrast, using file storage allows you to continuously update the file by appending new information to it — in other words, you update the file as the program creates new log records.
Object storage offers no such update ability. You can insert or retrieve an object, but you can’t change it. Instead, you update the object in the local application and then insert the object into the object store. To let the new version retain the same name as the old version, delete the original object before inserting the new object with the same name. The difference may seem minor, but it requires different approaches to managing stored objects.
Distributed key-value storage
Distributed key-value storage, in contrast to object storage, provides structured storage that is somewhat akin to a database but different in important ways in order to provide additional scalability and performance. Perhaps you’ve already used a relational database management system — a storage product that’s commonly referred to as RDBMS. Its rows of data have one or more keys (hence the name key-value storage) that support manipulation of the data.
Though key-value storage systems vary in different ways, they have these common characteristics:
✓ Data is structured with a single key that’s used to identify the record in which all remaining data resides. The key is almost always unique — such as a user number, a unique username (title_1795456, for example), or a part number. This ensures that each record has a unique key, which helps facilitate scale and performance.
✓ Retrieval is restricted to the key value. For example, to find all records with a common address (where the address is not the key), every record has to be examined.
✓ No support exists for performing searches across multiple datasets with common data elements. RDBMS systems allow joins: For a given username in a dataset, find all records in a second dataset that have the username in individual records. For example, to find all books that a library patron has checked out, perform a join of the user table (where the user’s last name is used to identify her library ID) and the book checkout table (where each book is listed along with the library IDs of everyone who has checked it out).
You can use the join functionality of an RDBMS system to execute this query; by contrast, because key-value systems don’t support
joins, the two tables would have to be matched at the application level rather than by the storage systems. Using this concept, commonly described as “The intelligence resides in the application,” executing joins requires application “smarts” and lots of additional coding.
The need for storage flexibility is why Amazon offers four types of storage. You may not need all four — many users manage with only one or two. You should understand all options that AWS offers, because you may then choose to pursue a new one rather than rely on the existing one.
Storing Items in the Simple Storage
Service (S3) Bucket Simple Storage Service (fondly known as S3) is one of the richest, most flexible, and, certainly, most widely used AWS offerings.
S3 storage basics
Let me get down to brass tacks and talk about how S3 works. S3 objects are treated as web objects — that is, they’re accessed via Internet protocols using a URL identifier.
✓ Every S3 object has a unique URL, in this format: http://s3.amazonaws.com/bucket/key
✓ An actual S3 object using this format looks like this: http://s3-us-west-1.amazonaws.com/aws4dummies/Cat+Photo.JPG
A bucket in AWS is a group of objects. The bucket’s name is associated with an account — for example, the bucket named aws4dummies is associated with my aws4dummies account. The bucket name doesn’t need to be the same as the account name; it can be anything. However, the bucket namespace is completely flat: Every bucket name must be unique among all users of AWS.
(Just so you know, an account is limited to 100 buckets.)
Bucket names have a number of restrictions, as described at http://docs.amazonwebservices.com/AmazonS3/latest/dev/
BucketRestrictions.html My recommendation: Stick with simple names that are easily understood, to simplify using S3 and avoid problems.
A key in AWS is the name of an object, and it acts as an identifier to locate the data associated with the key. In AWS, a key can be either an object name.
S3 object management
An S3 object isn’t a complicated creature — it’s simply a collection of bytes. The service imposes no restrictions on the object format — it’s up to you. The only limitation is on object size: An S3 object is limited to 5TB. (That’s large.)
Managing objects in S3
Like all AWS offerings, S3 is accessed via an application programming interface, or API, and it supports both SOAP and REST interfaces. Of course, you probably won’t use the (not particularly user-friendly) API to post (create), get (retrieve), or delete S3 objects. You may access them via aprogramming library that encapsulates the API calls and offers higher-levelS3 functions that are easier to use. More likely, however, you’ll use an evenhigher-level tool or application that provides a graphical interface to manageS3 objects. You can be sure, however, that somewhere down in the depths ofthe library or higher-level tool, are calls to the S3 API.
In addition to the most obvious and useful actions for objects (such as post,get, and delete), S3 offers a wide range of object management actions — for example, an API call to get the version number of an object. Object storage disallows updating an object (unlike afile residing within a file system). S3 works around this issue by allowing versioning of S3 objects — you can modify version 2 of an S3 object, for example,and store the modified version as version 3. This gets around the process to update objects outlined earlier: Retrieve old object, modify object in application,delete old object from S3, and then insert modified object with original object name.
S3 bucket and object security
AWS offers fine-grained access controls to implement S3 security: You can use these controls to explicitly control who-can-do-what with your S3objects. The mechanism by which this access control is enforced is, naturally enough, the Access Control List (ACL).
These four types of people can access S3 objects:
✓ Owner: The person who created the object; he can also read or delete the object.
✓ Specific users or groups: Particular users, or groups of users, within AWS.(Access may be restricted to other members of the owner’s company.)
✓ Authenticated users: People who have accounts within AWS and have been successfully authenticated.
✓ Everyone: Anyone on the Internet (as you may expect).
The access controls specify who, and the actions specify what — who hasthe right to do what with a given object. The interaction between the S3access controls and the object actions gives S3 its fine-grained object managementfunctionality.
S3 uses, large and small
Making specific recommendations about what you should do with S3 is difficultbecause it’s extremely flexible and capable. Individual (rather than corporate) users tend to use S3 as secure, location-independent storage of digital media. Another common personal use for S3 is to back up local files,via either the AWS Management Console or one of the many consumer oriented backup services.
Companies use S3 for the same reasons as individuals, and for many more use cases. For example, companies store content files used by their partnersin S3. Most consumer electronics and appliance manufacturers now offertheir user manuals in digital format; many of them store those files in S3.
Many companies place images and videos used in their corporate websites inS3, which reduces their storage management headaches — and ensures that in conditions of heavy web traffic, website performance isn’t hindered by inadequate network bandwidth.
The most common S3 actions revolve, naturally enough, around creating, retrieving, and deleting objects. Here’s the common lifecycle of an S3 object: Create the object in preparation to use it; set permissions to control access to the object; allow applications and people to retrieve the object as part of an application’s functionality; and delete the object when the application that uses the object no longer requires it. Of course, many objects are never removed, because they’re evergreen: They have ongoing purpose over a longtime span.
S3 offers encryption of objects stored in the service, securing your data from anyone attempting to access it inappropriately. Youcan log requests made against S3 objects to audit when objects are accessed and by whom. S3 can even be used to host static websites: They don’t dynamically assemble data to create the pages served up as part of the website —removing the need to run a web server.
When an AWS virtual machine (VM) needs to access an S3object, and the VM and the object reside in the same AWS region, Amazon imposes no charge for the network traffic that carries the object from S3 to EC2. If the VM and the object are in different regions, however (the traffic iscarried over the Internet), AWS charges a few cents per gigabyte — whichcan be costly for very large objects or heavy use.
CloudFront lets you store only one copy of an object and have Amazon make it available in every region.Given S3’s importance to many applications, an Given S3’s importance to many applications, an obvious question is how reliable is the service? The answer: It’s reliable. In fact, because AWS designed the service for 99.99-percent availability, it should only be unavailable for approximately 53 minutes per year. A complementary issue to availability is durability — how reliable is S3 at never losing your object? The answer to this question is even more exact — 99.999999999 percent.
How does AWS achieve this high level of availability and durability? In aword, redundancy. Within each region, AWS stores multiple copies of everyS3 object, to prevent a hardware failure from making it impossible to accessan object, or, even worse, from destroying the only copy of it. Even if onecopy is unavailable because of hardware failure, another is always availablefor access. If a hardware failure deletes a copy or makes it unavailable, AWSautomatically creates a new, third copy to ensure that the object remainsav ailable and durable.
S3 cost
S3 has a simple cost structure: You pay per gigabyte of storage used by your objects. You’re also charged for API calls to S3, which don’t vary by volume. Finally, you pay for the network traffic caused by the delivery of S3 objects.