AWS-CloudSearch

Searching with CloudSearch

Search is one of the most useful capabilities on the web, and huge businesseshave been built on search. (Ever hear of Google?) However, not all searches need to cover the entire Internet, and some searches shouldn’t be public. For example, you may want to make content on your company’s website searchable — or limit who can see the results of a search.

The challenge for many companies that want to enable search on their websites or other content repositories is that the quality of the typical search tools associated with content management systems is, to put it bluntly,awful. The situation is worse for companies that want to make a content repository — a big collection of documents dropped down into a file system, rather than an actual content management system — searchable. These environments have no search mechanism (no matter how flawed) available.

CloudSearch is capable of searching structured content, such as word processing files, and unstructured content — commonly referred to as free text,or unstructured collections of text-like web pages or forum posts.

Using CloudSearch is relatively straightforward, though a bit tricky to understand. The content you want to search has to be indexed (the data within the content is evaluated so that individual words can be located) so that indexes about the word (as well as the documents associated with the word) can be built. For example, if you want to be able to search a large number of documents about zoos, you need to build an index so that in a search for the word elephant, the search software can return every document containing the word elephant.

You upload the data you want to search into a CloudSearch domain, where the given domain name is the name of a searchable documents database. For data uploads, CloudSearch uses SDF (short for Search Data Format). Though CloudSeach can create SDF on the fly for certain types of data, such as PDF and Word files, for others you have to create the SDF documents yourself in order to upload your data. SDF documents can be formatted in either XML or JSON — two common standards for describing data collections. An SDF record is nothing more than a formatted set of key-value items describing the data you want to be able to search on.

After you upload the SDF documents, CloudSearch analyzes them and creates indexes of all the items you’ve indicated you want to be able to search on. For example, if you create a set of SDF documents outlining all players in a sport for a given year, you may search on the position played or the number of games played in the year. CloudSearch creates indexes on all fields you identify as searchable. Then you can execute searches against your domain on the fields you’ve identified as searchable.

You must also create access policies, which are analogous to EC2 security groups. You define the IP addresses that you want to allow access to CloudSearch, for both search access and domain administrative access. (Typically, you’d allow all IP addresses to search via CloudSearch because the most common use case is allowing visitors to a website to search informationon the website, but you may restrict search access to employees of your company or a small number of partners.)

Though you can execute searches from the AWS management console, the most common search is conducted via the CloudSearch API or the CloudSearch CLI (command-line interface). If you’re adding search capabilities to a website, you use the API method to perform searches on your CloudSearch domain.

CloudSearch resources

CloudSearch maintains a high performance level by keeping all indexes you’ve created within the memory of EC2 instances. Now, the obvious questionis exactly how many EC2 instances will the CloudSearch domain require? This number, however, isn’t one that you control; AWS automatically calculates how many instances your search domain requires and their size. CloudSearch supports three instances sizes: Small, Large, and Extra Large. If required, CloudSearch splits your domain indexes across multiple instances in order to retain them in memory and support fast search performance.

CloudSearch scope

CloudSearch is regionally scoped, which affects where you deploy your CloudSearch domain. If the website you’re enabling with CloudSearch is in a particular region, there’s no fee for network traffic if your CloudSearch domain resides in the same region. Of course, given that CloudSearch is accessible via an AWS API, searches can be executed from anywhere on the Internet, as well as within other AWS regions.

CloudSearch cost

Here are the hourly instance prices:

✓ Small search: $.10 per hour

✓ Large search: $.39 per hour

✓ Extra Large search: $.55 per hour

And here are the data transfer prices per month:

✓ First 10TB: $.12 per gigabyte

✓ Next 40TB: $.09 per gigabyte

✓ Next 100TB: $.07 per gigabyte

The issue of traffic prices may not be significant, because search results return text documents (both XML and JSON are text-based), which do not require much network traffic to send, so your traffic charges will probably not be that high. You face incidental charges for batch uploads and re-indexing, which shouldn’t add significantly to your overall CloudSearch bill.

AWS - Amazon Web Services

AWS Certified Solutions Architect