We’ve seen how searchers behave and how they interact with search results. We’ve decided what queries we want our sites to be found for. How do search engines compile these lists?
The Evolution of Search Engines
In the emerging days of the Web, directories were built to help users navigate to various Web sites. Generally, these directories were created by hand—people categorized Web sites so users could browse to what they wanted. As the Web got larger, this effort became more difficult. ‘‘Web spiders’’ were created that ‘‘crawled’’ Web sites. Web spiders, also known as robots, are computer programs that follow links from known Web sites to other Web sites. These robots access those pages, download the contents of those pages (into a storage mechanism generically referred to as an ‘‘index’’), and add the links found on those pages to their list for later crawling.
While Web crawlers enabled the early search engines to have a larger list of sites than the manual method of collecting sites, they couldn’t perform the other manual tasks of figuring out what the pages were about and ranking them in order of which ones were best. These search engines started working on computer programs that would help them do these things as well. For instance, computer programs could catalog all the words on a page to help figure out what those pages were about.
The Introduction of PageRank
Google’s ‘‘PageRank’’ algorithm in 1998 was a big step forward in automatically cataloging and ranking Web sites.1 This algorithm used data from the links on the Web to determine what pages were about and which pages were more popular and useful. Links were like votes for a site and the text from those links was used for cataloging them.
For instance, consider two Web pages. One is at the address www .myusedcars.com, and the other is at the address www.yourusedcars .com. Both contain text about cars and have the title ‘‘Used Cars.’’ Five Web sites link to www.myusedcars.com. Three use the text ‘‘site about used cars’’ and two use the text ‘‘lots of used Fords’’ in the links. Ten sites link to www.yourusedcars.com. Five use the text ‘‘site about used cars’’ and the other five use the text ‘‘lots of used Hondas’’ in the links (see Figure 5.1).
Google’s PageRank algorithm would use the information from the links to determine that both sites were about used cars. But when someone searched for ‘‘used cars,’’ Google would show www.yourusedcars .com first because it had twice as many links to it (10 versus 5). In addition, when someone searched for ‘‘used Fords,’’ Google would show www.yourusedcars.com, and, when someone searched for ‘‘used Hondas,’’ Google would show www.yourusedcars.com.
While Google patented the specific PageRank algorithm, this general method of categorizing sites became the standard for search engines and the basis of how search engine technology has evolved.
While people still refer to ‘‘PageRank’’ as a major factor in ranking well in search engines, this reference is now simply shorthand for the hundreds of signals that search engines use to compile a list of the most relevant results possible for a query.
The Current State of Search Engines
As the Web has evolved, a vast number of search engines have launched. They generally fall into one of the following categories:
- Human-edited directories—These are either constructed entirely by hand (as Yahoo! was originally) or constructed via a Web crawl or site owner submission and categorized and ranked by hand. Examples include the Open Directory Project (dmoz .org), About.com, and, more recently, Mahalo.com.
- Automated search engines—These are built algorithmically via Web crawlers and cataloged and ranked algorithmically as well. Since these types of search engines (such as Google) are the kind used by most searchers today, performing well in these is what this book focuses on.
- Meta search engines (aggregators)—These search engines generally use results from other search engines and present them in a different way either by presenting the results from multiple other engines together (such as dogpile.com) or with a different visual look.
How Search Engines Work
The major search engines that account for most market share today are the automated type: Google, Yahoo!, and Microsoft Bing2 (although Yahoo! and Microsoft have reached an agreement in which Yahoo! will replace its search engine with Bing, although this hasn’t happened as of late 20093). All three, as well as other smaller search engines that exist and operate their own technology, have similar overall infrastructure as follows:
- Web crawlers (also known as ‘‘spiders’’ or ‘‘robots’’) that crawl the Web. These crawlers follow links to discover the pages on the Web.
- Extraction processes that gather information from those pages (such as textual content, metadata, and links).
- Index storage that stores the content from Web pages. Content is generally stored using word-based keys, similar to the index in a book. When you look up a word in the index of a book, you learn the page number that word is on. Similarly, with a search engine index, the search engine can look up a word that someone is searching for and find out all the Web pages associated with that word.
- Results scoring that determines what pages are the most relevant for each search. When someone does a search (called a ‘‘query’’) and the search engine checks the index for all the Web pages associated with that search, the search engine needs a way to rank those Web pages in an order that is useful for the searcher. Search engines use a number of factors in scoring and these factors are adjusted all of the time based on new algorithms, tests, and other criteria. Search engines keep the details of these scoring factors secret. Once the search engine compiles and ranks the pages that are relevant for the query, it displays them in a list called ‘‘organic results.’’ The ranking process happens at the time of the query.
The Difference between Organic and Paid Results
Search engines primarily make money via advertising called ‘‘paid search ads.’’ These ads are also called pay per click (PPC) ads, because advertisers generally pay a certain amount each time a searcher clicks on them. For instance, Google’s advertising program is called AdWords. The way the system works is that when someone does a search, the search engine shows two lists of results: organic and paid (see Figure 5.2).
The organic results are based on the crawling, indexing, and results scoring infrastructure, and search engines strive to have the most comprehensive, relevant results they can to provide an optimal user experience. Most search engines (such as Google) don’t accept payment for placement in the organic results, and make no guarantees about whether
a site is indexed and where that site will rank. The paid search ads appear beside the organic results, and while search engines also attempt to show relevant, useful results in their ads, companies pay for them. Paid search ads are the primary source of income for search engines.
How Search Engines Rank Results
As Google noted in mid-2009:
‘‘Ranking is hard, much harder than most people realize. One reason for this is that languages are inherently ambiguous, and documents do not follow any set of rules. There are really no standards for how to convey information, so we need to be able to understand all Web pages, written by anyone, for any reason. And that’s just half of the problem. We also need to understand the queries people pose, which are on average fewer than three words, and map them to our understanding of all documents. Not to mention that different people have different needs. And we have to do all of that in a few milliseconds . . . By some estimates, more than one thousand programmer/ scientist years have gone directly into their [search engines’] development, and the rate of innovation has not slowed down.’’
‘‘The life span of a Google query normally lasts less than half a second, yet involves a number of different steps that must be completed before results can be delivered to a person seeking information.’’
Search engines use a number of factors to determine how to rank content. At a high level, search engines associate each set of content with a set of keywords they determine that content is about. When a searcher performs a query, the search engine retrieves all of the pages that are associated with that query, orders them according to relevance and usefulness (based on things like the number of relevant external links pointing to those pages, the anchor text of those external links, and calculations about intent—for instance, the search engine will try to show more e-commerce sites if the searcher intends to purchase something), and then ensures the resulting set of pages has sufficient diversity (doesn’t include duplicates or doesn’t consist of only sites of a single type).
How Search Engines Are Using Data About Searcher Intent
Over time, as search engines have gotten more confident about understanding intent, they’ve become more aggressive about displaying what they think the searchers want. Overall, this is a good thing because they’re providing what searchers are looking for even more quickly, without requiring searchers to make extra choices. Most searchers aren’t power users and just want to type the words into a search box and get back the results they’re looking for with as little work as possible (remember bounded rationality?).
Search Engine Suggestions and Prompts
Search engines incorporate many methods for integrating browser navigation into the search experience to enable searchers to drill into what they’re really searching for. These search prompts are typically based on past searcher behavior. For instance, Google lists related searches, Bing provides search categories, and Yahoo! includes an ‘‘explore related topics’’ pane.
Paying attention to what the search engines suggest for topics important to your business can both give you insight into what your potential customers are really looking for and help you understand how your company should be represented in search results.
Search engines customize results based on individual searcher behavior. They continue to ramp up their efforts over time to understand not only searcher intent in aggregate, but also intent for specific searchers.
Search engines need to provide relevant results to searchers (so that they’ll keep coming back to that search engine and be an audience for its advertising); and because searchers provide few clues about what they’re really looking for, we’ve seen that search engines use aggregate knowledge to discern meaning. But searchers are individuals, not the sum of their aggregated searches, so extrapolating the intent of one search based on the actions of all the searches that came before can rarely provide results as relevant as those tailored to the individual searcher.
Over the years, search engines have tried various customization options to enable searchers to take an active role in improving their results—everything from sliders to adjust ranking factors, filtering specific types of files, date ranges, and type of content.
But even though all of these could indeed improve results, for the most part, searchers have overwhelmingly ignored them. One reason for this goes back, again, to the idea of ‘‘bounded rationality.’’
Usability expert Jakob Nielson describes this tendency as follows:
‘‘The basic information foraging theory, which is, I think, the one theory that basically explains why the Web is the way it is, says that people want to expend minimal effort to gain their benefits. And this is an evolutionary point that has come about because the people, or the creatures who don’t exert themselves, are the ones most likely to survive when there are bad times or a crisis of some kind. So people are inherently lazy and don’t want to exert themselves. Picking from a set of choices is one of the least effortful interaction styles which is why this point and click interaction in general seems to work very well . . . [w]hereas tweaking sliders, operating pull down menus and all that stuff—that is just more work.’’
So the search engines have begun inferring customizations for searchers, rather than requiring them to explicitly set them. Search engines use a number of methods for personalizing search results. For instance, the engines may take into account what sites you’ve clicked on before and the ones you never click on, no matter how highly they’re ranked. Google also takes the previous search into account. If you search for [Hawaii vacations] and then search for [flights], Google may show you flight information for Hawaii, even though the word ‘‘Hawaii’’ wasn’t in the second search. If you have the Google Toolbar installed, Google will even track the sites you visit independently of those you reached from Google and will use those to shape your search results.
And while the current influence of personalization at Google is subtle, these efforts continue to advance. Google’s Marissa Mayer noted:
‘‘The actual implementation of personalized search is that as many as two pages of content, that are personalized to you, could be lifted onto the first page and I believe they never displace the first result, because that’s a level of relevance that we feel comfortable with. So right now, at least eight of the results on your first page will be generic, vanilla Google results for that query and only up to two of them will be results from the personalized algorithm. I think the other thing to remember is, even when personalization happens and lifts those two results onto the page, for most users it happens one out of every five times.
I think that overall, we really feel that personalized search is something that holds a lot of promise, and we’re not exactly sure of the signals that will yield the best results. We know that search history, your clicks and your searches together provide a really rich set of signals, but it’s possible that some of the other data that Google gathers could also be useful. It’s a matter of understanding how.’’
Frequent Google spokesperson Matt Cutts has also intimated that personalization is likely to increase:
‘‘The idea of a monolithic set of search results for a generic term will probably start to fade away, and you already see people expect that if I do a search and somebody else does the search, they can get slightly different answers. I expect that over time people will expect that more and more.’’
In December 2009, Google expanded how they personalize results, causing search expert Danny Sullivan to comment that ‘‘The days of ‘normal’ search results that everyone sees are now over. Personalized results are the ‘new normal,’ and the change is going to shift the search world and society in general in unpredictable ways.’’9 With this change, Google is now personalizing everyone’s results, not just the ones of those searchers who are logged in. If the searcher isn’t logged in, Google uses the search history from that computer to personalize results. Why is this such a radical shift in the search world? Because there’s no longer a base set of results. Now, more than ever, rankings reports don’t have much meaning.
Google says they won’t skew things so much that searching narrows our view of the online world. Google product manager Johanna Wright explained, ‘‘We want diversity of results. This is something we talk about a lot internally and believe in. We want there to be a variety of sources and opinions in the Google results. We want them in personalized search to be skewed to the user, but we don’t want that to mean the rest of the web is unavailable to them.’’
Anatomy of a Search Engine Result
We looked at how the search results are displayed and how important it is to ensure your site has a compelling listing. Let’s look a little more closely.
A Search Engine Results Page (SERP) typically contains 10 results for a query. Each search result contains the following results (see Figure 5.3).
- Title: This is generally the same as the title of the page.
- Snippet: This is the description beneath the title. This is your main opportunity to provide a marketing message to searchers
to entice them to click through to your site.
- URL: The URL of the page. Any keywords from the query are bolded.
Search results can include a number of other components as well, such as (see Figure 5.4):
- Navigational links to sections within the site.
- The date the page was published.
- Links to forum content.
- Rating and review information.
The Evolution of Organic Search Results: Beyond Web Pages
Originally, search engines indexed text on Web pages and matched text-based searches to that content. As the Web evolved, search engines began looking at ways to catalog the new types of content on Web pages, such as video and images. They created ‘‘vertical’’ indices that enabled searchers to look specifically for images, for instance. However, search engines found that few searchers noticed these vertical indices and that most searchers looked for everything from the main search page.
In 2007, Google introduced ‘‘universal search,’’ a new way of compiling results that blended content from all of their indices, including textual Web content, images, videos, news, and product listings.10 The other major search engines soon implemented their own versions of ‘‘blended search’’ and today, it’s common to see all of these things appearing together in search results (see Figure 5.5).
This blending provides opportunity for site owners. Now, in addition to listing your textual content, you can showcase your multimedia content as well.
When searchers see blended results, they view the page in a slightly different way than when the results are text only. Look at this eye tracking heat map from Enquiro (see Figure 5.6).
Most searchers looked at the image and, in fact, more than looked to the results below the image (d) than to those above it (b).
The appearance of non-textual content in the search results continues to rise. Marissa Mayer, VP of Search Products and Experience for Google, noted:
‘‘I think there’s a ton of challenges, because in my view, search is in its infancy, and we’re just getting started. I think the most pressing, immediate need as far as the search interface is to break the paradigm of the expectation of ‘‘You give us a keyword, and we give you 10 URLs.
We need to look at results pages that aren’t just 10 standard URLs that are laid out in a very linear format. Sometimes the best answer is a video, sometimes the best answer will be a photo, and sometimes the best answer will be a set of extracted facts. If I type in general demographic statistics about China, it’d be great if I got a set of facts that had been parsed off of and even aggregated and cross-validated across a result set.’’
And in fact, Google has begun providing more answers in their search results, in a way very similar to what Marissa Mayer described. In January 2010, Google began using its Google Squared service12 to extract facts from web pages and display them in the search results.13 For instance, with the query [how high is the space needle], Google no longer simply lists the pages that might contain this information, but surfaces the answer in the search results in Figure 5.7.
The next section of this chapter discusses content beyond Web pages that has become standard in search results.
Blended Search: Images
One common type of blended result is images. To see the value of blended image search results in action, let’s return to my trip to Bologna. I had heard that I should visit the church of San Luca, but I knew nothing about it. So, I went over to Google and typed in [San Luca Bologna], hoping to find a description and perhaps a map. But Google (likely based on the clicks of millions who had come before me) knew my intent even better than I did, and showed me not only descriptive details and a map, but also images (see Figure 5.8).
Even though I didn’t know I was looking for them, the images caught my attention first (even though I had been specifically thinking I was looking for a map!). I clicked on the image at the far right and came across a Telegraph article about Bologna, which in turn led me to some great information about local hotels, restaurants, and shops (see Figure 5.9).
And so, blended search and PR have joined together with search to bring more visitors to the businesses of Italy.
more information on how PR can work together with search for better visibility.)
From there, I clicked ‘‘Back to image results’’ and then to another image found on virtualtourist.com (see Figure 5.10).
I continued to explore ‘‘off the beaten path’’ tips, and just like that, Virtual Tourist gained a user.
Of course, tourism isn’t the only industry that can gain search acquisition from images. One of the bigger industries image searches can lead customers to is e-commerce.
Let’s say, for instance, that I’m looking for blue and white plates. The easiest way to research them may be to simply do a search, which gives me the results in Figure 5.11.
I can start my shopping on NexTag or Amazon, but if I’m not sure yet exactly which blue and white plates I want to buy, the image results
seem like a great place to start. If I dive into the full list of images, I can not only see all of the different choices (critical for a purchase such as this one), but I can also choose between auctions, set replacements, and new retail.
As you might imagine, the search conversion workflow becomes critical here. If the site design doesn’t take into account a visitor from animage search who may come directly to the page that contains the image, vital information about the site, its navigation, and how to order may not be present, and many visitors may be unsure how to proceed and might go back to the search results.
The Drayton Hall site provides a great experience for a visitor coming from an image search. The site, its value proposition, the overall navigation, and the item price and path to purchase are readily apparent (see Figure 5.12).
On the other hand, the Yoko Webster page provides no information at all (see Figure 5.13).
In addition to blended search opportunities for images, searchers conduct more than a billion image-specific searches a month, indicating that image optimization definitely shouldn’t be overlooked
How can you best take advantage of this? Make sure your site includes images where they’re useful, particularly if you’re selling products. And ensure your Web development team implements images in a search-friendly way.
Blended Search: Video
Video provides another big opportunity for search acquisition. Not only do videos often appear in blended search results, but YouTube is now
the second largest search engine (after Google). You can now submit product videos to Google Product Search15 as well, so clearly video opportunity continues to increase. And views of online video increased 41 percent from August, 2008, to August, 2009,16 so potential customer behavior continues to evolve from primarily consuming text online to being comfortable consuming video as well. We conduct more than 2.6 billion searches a month on video search engines.17 Eighty-five percent of online Americans view videos online each month, for a total of 26 billion video views.
People love videos. Tutorials, for instance, provide a great opportunity for video as a complement to textual instructions.
Assess your site for video opportunities, create a YouTube channel for your videos, and make sure that as with web pages, you use descriptive text that uses the language of your target audience (see Figure 5.14).
How Do Universal Results Impact Searcher Behavior?
Enquiro’s eye tracking studies found that when an image or video is present in the top half of the search results, the searcher seems to start the page scan there, rather than the top left.19 (See Figure 5.15.)
In a text-only set of results, searchers begin in the top left corner, scan right, and then down, and they tend to chunk the page into result sets of 3–4 results and evaluate those one by one. If searchers don’t find the answer in the first chunk, they move on to the next one.
In results that contain multimedia such as video and images, searchers started with the multimedia result as the first chunk and then scanned above it for the second chunk and below it for the third chunk (see Figure 5.16).
Similarly, when a Local Onebox is on the page, searchers tend to evaluate the local listings as a separate chunk.
Clearly, optimizing for images and video is useful not only because it provides additional ranking opportunities, but also because it enables you to stand out to searchers over the competition.
A notable influence on personalized results is regional location. Location-tailored results can be a substantial part of the relevance calculation. At a basic level, all results are personalized based on location. A searcher in the United States will get more results from U.S./Englishlanguage Web sites, whereas a searcher in Italy will get more results from Italian Web sites written in Italian. In addition, certain queries have additional location-based relevance factors. For instance, someone searching for [pizza] in Seattle is likely to see Seattle-based restaurant results and someone searching for [pizza] in Boston will likely see Boston-based restaurant results.
Regional relevance factors are particularly important to businesses that are local in nature (such as pizza restaurants) and those that serve particular countries (British Telecom would like its site to be surfaced to British searches; HP would like its Spanish site to be served to Spanish searchers). These factors get trickier as the intent becomes more complicated. What about U.S. searchers who are looking for vacation accommodations in Croatia? Or a French speaker on vacation in Italy who is looking for information about trains in Poland? And how should businesses target the European Union? Or how should a business target researchers in Mexico, rather than in Spain, with Spanish content?
Getting Technical: How It All Comes Together
Now, for the technical part. But don’t worry! I’m not asking you to go make changes to the servers yourself!
Before a search engine can evaluate the content on your site to determine if it’s relevant for a searcher’s query, the engine has to know the page exists and extract that content from the site for analysis.
- Discovering the pages: Search engines find out about pages on the Web generally by following links from other sites on the Web and by following a site’s internal links. The most important thing to remember about the discovery process is that you should build a great site that makes others want to link to it and that you should have a comprehensive site navigation structure. Of course, you’d want both of these things on your site even if search engines didn’t exist.
- Crawling the pages: Once a search engine such as Google learns about pages on the Web, it uses a ‘‘bot’’ to crawl those pages. Your goal is likely to have your entire site crawled, which can be hindered by crawling inefficiencies and by infrastructure issues that make URLs inaccessible to the bots.
- Extracting content: Once a crawler has accessed a page, it has to be able to extract the content from that page and store it. As with crawling, a number of obstacles may keep a search engine from extracting content from your pages. Common issues include all Flash sites, pages full of multimedia such as videos and images with no textual markup, and sites built on technologies such as AJAX that search engines can have trouble parsing. A good rule of thumb is that if you develop the site using progressive enhancement techniques20 that make it accessible for visitors with disabilities21 (who are using devices such as screen readers to access your site) and on mobile devices, search engines can generally access the content as well.
Once the crawler has accessed the pages and extracted the content, the search engines make a decision about whether to store that content. They’ll generally not store pages if they determine that those pages are mostly empty, are duplicates of pages they’ve already stored, or have little value (for instance, the pages may be an aggregation of content that exists elsewhere on the Web). One common cause of duplicate content is syndication.
Crawling and indexing issues tend to be technical in nature. It can be vital for Web developers to understand these parts of the process to ensure the site’s technical infrastructure isn’t blocking the search engine bots. Ranking, on the other hand, is primarily dependent on how relevant a page is for the given query. If you build searcher personas and incorporate search data into your product development processes, you are already taking the important steps toward ensuring your pages are as relevant as possible to the searchers you are targeting. As noted above, relevant, authoritative links also help search engines understand the value of your site’s pages, and those come naturally with useful content and a successful marketing strategy.
According to leaked 2008 Google quality guidelines, utility (how helpful the page is for the searcher based on intent), ‘‘is the most important aspect of search engine quality.’’
Once you start to think about personalization, differing results based on searcher location, and blended search results that may cause the searcher’s attention to focus on a result other than the top-ranked site, you start to realize why ranking reports don’t provide much actionable insight.