How to Turn Business Strategy into Tech Speak
If the search engine bots can’t crawl and extract content from a site’s pages, that site has little chance of ranking well in search engines for relevant queries. In addition, changing a site’s content management system or server setup, merging sites after an acquisition, building micro sites and other common activities can greatly impact search acquisition.
Web developers should understand the principles of searchability and build them into the infrastructure and should have a set of best practices for modifying that infrastructure and troubleshooting problems. This chapter provides an overview of the technical issues involved with search acquisition. (You can find more detailed discussion at marketing intheageofgoogle.com.) You can use this chapter either to give directly to your Web developers, or if you’re interested in the technical details, it can help you better understand the technical issues so you can have productive conversations with the development team about how best to build search best practices into the development process.
In order for search best practices to be successfully implemented in Web development, the development team needs executive support.
This means the developers need to be given the training to expand their core skill set into understanding how to make Web infrastructure searchable, they need time to test searchability in addition to functionality, and they need to be provided context about why they are being asked to make changes. Web developers know how to make sites functional and might be suspicious if you ask them to make changes which don’t seem to improve functionality. But if you explain that both 301 and 302 redirects transfer the visitor from one page to another, only a 301 tells search engines to index the new page instead of the old one, you’ll get the developers on your side much more quickly.
The crawling process consists of several components. First, search engines have to learn about the pages. Then, they need to have enough time during the period they’ve allocated to the site to crawl those pages. Finally, they have to be able to technically access the pages.
The first step in crawling is discovery. How does the engine find out about pages on the Web? Generally, this happens in one of the following ways:
- By finding links to the pages from other sites on the Web.
- By finding links to the pages from within your site.
- From an XML Site Map.
What’s an XML Site Map?1 It’s a file in a particular format that contains a list of all the URLs on a site. Google, Yahoo!, and Microsoft Bing use this file to augment their discovery processes. Submitting a Site Map can help search engines know more comprehensively about your site. See sitemaps.com for more information.
Search engine bots have resources to crawl every page on the Web and they are also mindful of not over crawling a site and causing an undue burden on the server. For these reasons, search engine bots only spend a limited period crawling each site. Factors that influence a more comprehensive crawl include:
- Fast server response times. If the server is slow to respond to requests, the search engine bots may slow down their crawl to ensure they aren’t overloading the server.
- Fast page load times. The faster the pages load, the more of them search engines will likely be able to crawl during their allocated crawl period. You can monitor page load times for Google in Google Webmaster Tools. If you see a spike in page load times when you haven’t made significant changes to your site (such as added substantial multimedia content), there may be a problem with the server (see Figure 7.1).
- Unique content. If much of the content appears to be duplicate or empty (this can happen, for instance, with directory sites that have regions, categories, or entries that don’t yet have content), the search engine bot may stop crawling.
- Accessible URLs. URLs may be inaccessible to crawlers for a number of reasons, such as redirects are set up on the URLs that inadvertently create an infinite loop (so URLs redirect back and forth to each other) or that create a long redirect chain.2 You can see a report of any URLs Google’s bot couldn’t access via Google Webmaster Central.
- Crawl efficiency. If the crawler spends all its time on pages that you don’t need to have indexed (such as registration pages and contact forms), less time will be available to crawl the pages you do want indexed. You can block content you don’t want indexed using robots.txt.
- Server efficiency. You can reduce the resources the search engine bot consumes per page by enabling compression and if-modified-since on the server. Compression serves pages to the bot in a compressed format and the if-modified-since response returns a 304 (not modified) to the bot when it requests a page that hasn’t changed since the last request rather than the entire contents of the page.
- Bot speed control. You can slow down the crawl of Yahoo!’s and Microsoft’s bot by using the crawl-delay setting in robots.txt.6 If either of these bots seem to be crawling your site particularly slowly, check the robots.txt file to see if this entry exists. You can slow down Google’s crawl of the site in Google Webmaster Tools. If Google is limiting its crawl because it’s not sure if your server can handle a higher load, Google Webmaster Tools will present a ‘‘faster’’ option that you can specify as well (see Figure 7.2).
- Canonicalization. Canonicalization is the process of consolidating all duplicate URLs to one original, or ‘‘canonical,’’ version.7 If multiple URLs exist that all lead to the same page, then search engine bots may spend a considerable amount of time crawling the same page over and over via different URLs and not have time to get to unique pages. A number of methods exist for canonicalization of URLs.8 The method to use depends on the canonicalization issue. Common issues include:
- The www and non-www version of the site both resolve. Ideally one version should redirect to the other.9 For instance, when someone types in mysite.com, you can set the server to redirect to www.mysite.com. Without this redirect, an entire second copy of your site exists that search engines will try to crawl.
- The URL structure has changed so content exists on both the old pages and the new pages. In this situation, a redirect from the old pages to the new is generally the best way to go, particularly since it helps visitors to the old pages end up in the right place.
- The URL structure generates infinite URLs. With some URL implementations, any number of URLs may bring up the same page. For instance, on an e-commerce site, a product listing page may be available in different sort orders (ranked by lowest price, highest rated, etc.), but the content in each case is the same (just ordered differently). For instance, mystore.com/ shoes.php?sort=lowest, mystore.com/shoes.php?sort=best, and mystore.com/shoes.php?sort=newest might bring up the same list of products. Another common way this happens is if the system is set up so that the marketing department or ad agency can append tracking codes to the URLs to keep track of marketing campaigns. For instance, the marketing department might send information about the new shoes page out in an e-mail newsletter with the URL mystore.com/shoes.php? source=email andmight let bloggers know about the new shoes page, hoping they’ll blog about it with the URL, mystore.com/ shoes.php?source= blog. Both of these URLs bring up the same page. Session IDs appended to URLs can also cause infinite URL issues. If possible, store session information in cookies (and make sure those visitors who don’t have cookie support—such as search engine bots—can access the site).
When a crawler detects that a page can load from infinite parameters, it may stop crawling that site to avoid being caught in what’s known as a ‘‘spider trap.’’ Another problem with this type of structure is that since the crawler has limited resources to spend on a site, time spent crawling the same page over and over from different URLs means that less time is available for crawling unique pages. In this situation, the canonical meta attribute10 or the Google Webmaster Tools parameter handling tool11 are good options.
- Pages are blocked by the robots exclusion protocol.12 The robots exclusion protocol enables you to block search engines from crawling parts of your entire site via either a robots.txt file located at the root of your site or via a meta tag on the source code of your pages. Many valid uses of this protocol exist, but it’s easy to mistakenly block search engine bots from pages you do want to be indexed. If you find that your site isn’t indexed as well as you’d like, check that the bots aren’t accidentally being blocked out.
Once a search engine bot has crawled a page, it attempts to store the contents of that page in the search engine index. Common reasons a search engine may not store the contents of a page include:
- The content is locked behind registration. If you require registration to view your content, search engine bots can’t access it. A number of options exist for balancing search acquisition needs and registration requirements. For instance, you can provide abstracts of content outside of registration or you can participate in Google’s First Click Free program.13 This is discussed further at marketingintheageofgoogle.com.
- The content is hidden in Flash or Silverlight. Search engines have gotten better at crawling Flash pages, but a number of problems remain. You can find a list of resources on making Flash accessible to search engines at marketingintheageofgoogle .com.
- Little extractable text is available. If the pages are full of video and images, search engines have little text with which to understand what those pages are about. At their core, we’re still dealing with text-based search engines that return pages based on a searcher’s text-based query. There are a number of ways you can ensure search engines can have some information about multimedia on your pages, such as images and videos:
- Images14—Major search engines such asGoogle continue to be, at their cores, text based. They can’t understand images without textual content. To ensure your site’s images can be properly indexed and ranked by search engines, use descriptive ALT text, make image filenames descriptive (for instance, yellow-goldfish. jpg rather than image123.jpg), use descriptive text around the image (in captions, headings, and titles), and use high resolution, high quality images whenever possible. Be cautious about using images for navigation and avoid putting text into images.
- Video—YouTube is the second largest search engine in the United States.15 In August 2009, 40 percent of all online videos viewed in the United States were seen on YouTube. Your best bet for ensuring your video can be found may be to simply host it on YouTube. If you host it elsewhere, ensure the host is feeding Google a Video Site Map.16 You should also provide transcripts, if possible, and use descriptive headings and descriptions for your videos.
- Multiple sets of content associated with a single URL. If the site is set up so that the URL doesn’t change when the content changes or based on a single URL rather than allowing different users to be served regional content based on where they’re located, search engines will only see one version of that URL. This often happens when the page is dynamically generated based on a visitor’s location, for instance. A better strategy is to redirect to a separate URL with that regional content. For instance, local.com has a dynamically generated home page with local business information. Because the URL is the same no matter what content is loaded into the page, Google has indexed this URL with content from Mountain View, CA (where many Google bot computers crawl from) (see Figure 7.3).
Local.com could avoid this issue by redirecting the visitor to a URL such as local.com/mountain-view-ca or mountain-view-ca.local.com and displaying the local content there.
As we’ve seen throughout this book, ranking is based on what pages search engines have calculated to be most relevant to what they have determined the searcher is looking for. You can’t influence how the engines interpret searcher intent, but you can ensure your pages are as relevant as possible for target queries. The bottom line, of course, is to understand your customers and what they are looking for and provide exactly that. But on a more tactical level, relevance is based in large part on your site’s content and on external links.
Content Architecture, These components include page titles, meta descriptions, and heading tags, as described below. Key to making this work within an organization is ensuring that these components can be easily changed without requiring code change or launch. The content management system or other mechanism for building content should enable marketers, content writers, and others to easily change this text.
Page titles—This is what’s contained in the <title> element in the source code and appears in the browser title bar. This text is important because it’s what appears as the default title of a bookmark and generally is the title of the page in the search results. When possible, this tag should be formatted as follows:
Most important keyword in compelling phrase + Brand For instance,
How to Rollerblade—eHow Videos
In the search results in Figure 7.4 [how to rollerblade], the first two results have titles that do a good job of describing what the page is about and the context (the site that the page is on). The third result is less trustworthy because the branding is unclear and the result below the videos is both missing branding and doesn’t include the main keywords I was searching for. (This doesn’t mean that this fourth result, ‘‘How to learn to rollerblade in a safe way,’’ is poor phrasing. The site owners may have determined that ‘‘how to learn’’ and ‘‘safe’’ were crucial for attracting their target audience. And in fact, a search for [how to learn torollerblade] brings them up first.) Each page should have a unique title that focuses on the core topic for that page. The page title is likely the most important element on the page.
Meta description—The meta description attribute is contained in the page’s source code and is often used as the description below the title in the search results.17 This is your chance to provide a targeted marketing message to engage with potential customers. It should be unique to each page and should give context to the content of the page. Since the search results description real estate is limited be concise, provide a compelling value proposition for the page, and use the primary keywords you’ve chosen for the page. However, don’t make the mistake of repeating the keywords, as this doesn’t help ranking and can make the result look spammy. The words from the search query are bolded, which can draw the searcher’s attention to your listing.
Search engines don’t always use the meta description in the results. Based on the query, other text from the page may be deemed more relevant and may be shown instead. Search engines will also pull this text from elsewhere if the meta description text is too short.
Let’s take a closer look at the [how to rollerblade] results.
The first result (from ehow.com) has an informative and compelling description. From the source code, we can see that this text comes directly from the meta description (see Figure 7.5).
The second result (wikihow.com) isn’t quite as succinct or compelling.
It wastes valuable real estate by repeating the brand (which is already evident in the title), it is cluttered by ellipses and unfinished phrases, and it has casing issues. Looking at the source code, the problem is clear. While the meta description tag exists, it’s too short for Google to consider it to be meaningful (see Figure 7.6). Therefore, Google
has used it, has added content from other places on the page to construct a search results description. The repeated brand and use of mixed casing comes directly from the site-provided meta description.
The title of the page (‘‘How to Rollerblade’’) likely was dynamically added to the meta description. This practice can be a scalable way to create unique meta descriptions for a large number of pages, but when implementing a technical solution like this, review the finished product. In this case, the description could be made significantly better for users by simply adding a line of code that lowercases the words when inserted into the middle of a sentence.
Headings—Whenever possible, pages should use semantic markup for clearer meaning for both search engines and devices such as screen readers. Headings are marked up in HTML using <H> tags. Generally a page has one <H1> tag, a few more <H2> tags, and so on, much like numbering an outline. Headings are important for a couple of reasons. Search engines can use the text in them to determine relevance, although how much weight they’re given may vary. More importantly, a descriptive heading can provide context to searchers to let them know they’re on the correct page. As you recall from the page content phase of the searcher workflow, this context is important for anchoring the visitor and ensuring they don’t bounce back to the search results.
Content—Each page should contain enough text to provide valuable information to the visitor. This text should use the language of the visitor (based on keyword research).
There are good reasons for syndicating content. Syndication can bring traffic, exposure, and sales.
If you’re a blogger, you might syndicate your posts to get wider distribution. If your posts are seen by a bigger audience, you might gain some of those readers for yourself. If your site provides authoritative resources, you might have a partnership with other sites that want to include that content. And if you sell products, you might provide affiliates with content feeds, which in turn brings in additional revenue.
But What Should Rank? But from a search engine perspective, syndication can cause a bit of a conundrum. If what you wrote is a relevant result for a search, the search engine wants to show it to the searcher—but not show it twice (or three times, or maybe even a thousand times in the case of an affiliate feed). And that makes sense. If you’re searching for something, you don’t want multiple results that all lead to the same content even if that content is on different sites.
So what’s a search engine to do?
Search engines generally identify duplicate results and filter out all but one. They have lots of ways to decide which version to show. They try to figure out which one is the ‘‘original’’ by looking at things like which version was published first and which has the most links pointing to it.
Your content may appear on other sites at times other than when you syndicated it (such as when your RSS feed has been scraped), and search engines try to account for that too by looking at things like which site is more authoritative.
How Can You Make Sure Your Site Ranks First? So what do I suggest you do if you’re syndicating content but want your original version to rank above the syndicated ones?
- Create a different version of the content to syndicate than what you write for your own site. This method works best for things like product affiliate feeds. I don’t think it works as well for things like blog posts or other types of articles. Instead, you could do something like write a high level summary article for syndication and a blog post with details about that topic for your own site.
- Always include absolute links back to your own site in the body of the article. This is particularly helpful when your content is scraped.
- Ask your syndication partners to block their version of your article (via robots.txt or a robots meta tag or the re=canonical attribute that points back to your site). If you are able to, put together a syndication agreement that states they get your content as a benefit for their readers, not as a way to acquire search traffic for that content; then you can keep control of ranking for what you’ve written and they can provide a benefit to their audience.
Maintain control. If search is not yet a large acquisition channel for your site, then you may not mind if another site ranks for your material as you may get more traffic from the syndicated site (so make sure you at least have a link back to your site). But as your site starts to stand on its own and search traffic starts growing, you will want to have more control. So think of your longer term strategy when you negotiate syndication partnerships and don’t give up all of the control of the content you work so hard to create to others.
Content That’s Not Unique
Syndication isn’t the only instance in which duplication can cause ranking problems. Another common issue comes up with e-commerce sites. Many companies use manufacturer databases or product feeds to describe products for sale. After all, it may not be reasonable to write unique product descriptions for millions of products. But remember that thousands of other sites may be using those identical product descriptions. How do search engines decide which to rank?
You can try to outrank every site that uses the same product descriptions, but a better strategy may be to add unique value on top of that content. For instance, you can aggregate information to provide comparisons between products or brands. You can also enable usergenerated content so visitors can provide reviews, ratings, and comments. This information adds value to the page beyond the boilerplate description.
Search Engine Tools for Webmasters
All three major search engines have tools and educational resources for site owners. These tools provide great diagnostic reports, as well as statistics, and ways of providing input to the engines. You can find these tools at:
Google Webmaster Central: google.com/webmasters
Microsoft Bing Webmaster Center: webmaster.bing.com
Yahoo! Site Explorer: siteexplorer.search.yahoo.com