Your web metrics are all wrong, and they'll never be all right!


You have a web site, so what!
Having a web site is no longer a competitive advantage, you know that. Managers are expected to quantify the impact of their web properties on the bottom line of their businesses, and as a result the web analytics industry is booming (The industry is supposed to be at $1 Billion by 2006).

Four years and more than 100 site analyses later, I've come to the conclusion that web analytics will never be an exact science even though its' influence in the decision making process is growing.

There are issues inherent in web data analysis that can make it difficult to get accurate, insightful data. Most people are not aware of many of these issues. This article's purpose is to make you aware of the pitfalls associated with web data analysis, so you can account for their impact or fix them.

This week, we'll look at some of the ways your data can be skewed by the very technology it seeks to leverage. Below, I will list some of the major technology issues that may prevent you from having 100% accurate web traffic data. I will also illustrate how these issues can impact your business.

1. AOL Proxy servers
AOL proxy servers are a killer to traditional web site analytics programs. If you want to learn more about them, go here, but here is a summary (my explanations are italicized) from AOL:

"When a member (AOL user) requests multiple documents for multiple URLs (web pages, PDFs, etc.), each request may come from a different proxy server (A different IP address). Since one proxy server can have multiple members going to one site, webmasters should not make assumptions about the relationship between members and proxy servers when designing their web site."

Implication on your business:
If one AOL user views 10 pages on your site, your web site analytics tool could be mislead into thinking that 10 different users came to your site and each viewed only one page.

I know when I see a lot of single page loads to the home page, I start considering making minor changes to get people to click further, incorrect data can lead to you making the wrong decisions.

The reports that quantify number of visitors, or unique users, have the potential to be highly inaccurate, if you do not account for or fix the AOL proxy server issue. While many corporations and some ISPs use proxy servers, this sort of scenario does not pose a problem because the number of users coming through non-AOL proxy servers to web sites is small. But for AOL it presents a huge problem, as AOL drives up to 50% of the traffic on some sites I analyze.

2. Random spiders
Most log file analysis tools recognize when Googlebot (Google), Scooter (AltaVista), Slurp (Inktomi), or any of the major search engine spiders (A spider is an automated program, designed to gather information on web pages) visit a site, that the visitor viewing web pages is actually an automated program and is not a legitimate user. Typically, most decent analysis tools filter out data resulting from spiders automatically.

This is good news! Using a tool that automatically recognizes and filters out automated spiders helps you get closer to reporting, analyzing, and making decisions on 100% correct data, which as I mentioned earlier should always be a goal, even though it is a very elusive one.

Here's the bad news…any programmer can create a spider and send it to your site. There are thousands of unknown spiders, and some of them are crawling your site, inflating your web data even as you read this article. These unknown spiders usually get through the average web analytics tool filters because these spiders don't identify themselves as such, but instead appear as regular users.

E-mail harvesters are an example of random spiders. An e-mail harvester is an automated program that is built to traverse the web looking for e-mail addresses to add to its database to SPAM later. Ever wonder how a spammer got your e-mail address? One popular method is through automated spiders like e-mail harvesters.


Implication on your business:
What are the repercussions to your decision-making processes when a spider hits your site 1,000 times in 30 seconds and doesn't get filtered?

Well, think about it, if you just started an ad campaign or new marketing initiative, and suddenly a significant amount of traffic came to your site, you might attribute this success to your new ad campaign or marketing initiative. In actuality the increases you saw in your data may have been the result of an unknown and unfiltered spider. Even worse, if you assumed that your campaign was a success, you might extend the campaign and spend more of your budget on it, essentially throwing money out of the window.

NOTE: If your web analytics solution does not require reading log files, but instead uses a small piece of JavaScript on each page, then you may not have this issue. Spiders typically can't read JavaScript and will not register in your web site analysis reports.

I recently spoke with a rep at IBM who works with their surfaid analytics program, and he explained that their software uses logic to automatically filter out automated spiders. If a user loads a given amount of pages in a certain period of time, then the user can be automatically filtered out. IBM's surfaid team also keeps track of the growing list of spiders and updates their software to filter them out.

This is the first program I have seen that recognizes the importance of automatically filtering out suspicious activity, as it can lead to highly inaccurate data.

3. Frames
If your site is developed in frames, take your number of page loads and divide that number by three. That is how many pages may actually have been loaded on your site. A framed site typically loads three pages for every single page a user views on their screen. Frames tackle many problems with site development, but open up a slew of other issues with tracking and marketing your web site.

Implication on your business:
Guess what the above scenario can do to your data…triple it! If your site (or parts of it) is developed in frames, then any data you have reported for the site (or a part of it) may be tripled, if you did not account for, or filter out the additional pages loaded in the frameset, ouch!

4. Flash & Dynamic sites
Flash is becoming increasingly popular as a web site development tool. Take a look at my favorite flash site of all time: http://www.NeoStream.com, once you are done being WOWED by the site, you will notice that as you move around the site, the URL bar (where you type in web addresses in IE or Netscape) never changes.


Implication on your business:
While beautiful on the surface, developing a site in this manner will devastate your web analytics initiatives, as it will appear to your analytics tool, that the entire site is comprised of only one page.

No matter how many different pages a user views, it will always appear as if they load the home page over and over again. Fortunately not all flash and dynamic sites are programmed in this manner, but many still are, and for those that are, true analysis can be difficult and sometimes impossible.

Analyzing data for conversion metrics, ROI for various marketing campaigns, top entry points into your site, user paths through site, fall off rates, and many other essential Internet business metrics are not usually possible without paying for additional programming changes to correct the one-page site dilemma.

5. Sharing secure certificates
When a user leaves the public area of your site and moves to a secure area (maybe where a credit card is processed), sometimes a very different, unique URL is used :

(I don't mean a user that goes from: http://www.mysite.com to https://www.mysite.com, but more like http://www.mysite.com to https://secure057.notmysite.net)

When secure transactions happen on someone else's server in which you are "sharing" a secure certificate with others, you do not have access to that log data.

Implication on your business:
Once someone starts to buy a product or complete a application / lead and at some point leaves your site to go to the shared secure site, their activity starts getting tracked on that shared secure site, which means they will essentially "disappear" at some point in your site's log, giving you an incomplete view of user activity through one of the most important parts of your site, the conversion.

Getting this data from a shared hosting environment could prove rather costly, and may even be impossible depending on how flexible your web site host is.

The #1 question some of you may have asked after last week's article is:
"Should I have my site redeveloped so I can get more accurate data, which will lead me to making better decisions about my business?"

Here's my answer:
The more directly correlated your web site is to the success of your business, the more you should focus on having data as accurate as possible.

You are already half way there!
Realizing that your web data is going to be flawed, and realizing the technologies and how they can affect your analysis is half the battle.
Here are some specific ways you can "fix" web data problems.

1. Use cookies or unique user logins
Use cookies. Cookies helps recognize true repeat visitors, and gives marketers a much better handle on where a user's path begins and ends, how many pages a user viewed in a particular session, etc.

Using cookies with your analysis tool may take some time to configure, but stick with it. It has the alternative benefit of alleviating the proxy server issue. I feel much more confidant in web data when cookie technology is used in conjunction with a web analytics tool.

You can even go one step further and use unique user logins, but unfortunately this isn't always feasible. However if you are analyzing the effectiveness of an Intranet site, where users have to log in to get access, then unique user logins will help you get data that is even more accurate than cookies. Remember: Some cookies will be rejected or deleted occasionally by cookie washers (software designed to clean all cookies off of a computer), but the vast majority won't be, so your data should be pretty solid.

2. Educate yourself on the basics of programming
The accuracy of your data hinges on how your site is programmed.

Learn just enough about web programming to be able to articulate to your programmers what the pitfalls are. Being able to "talk the talk" will help your programmers to avoid some of the issues that come with dynamic and framed web sites. If your site uses flash technology, explain to your designer that each time a link is clicked the URL bar should change. You don't need to understand how this gets done, but you do need to be able to articulate these kinds of changes in a language that your programmers and designers will understand.

3. Analyze your data for anomalies
Sort your page views in ascending order. If you see a user loading 21,000 pages in 1 hour, you may want to filter out that user, or that user's IP because it is probably an automated spider. A human being is not likely to be able to load and read 350 pages per minute.

4. Focus on what matters
Conversions, e-mail newsletter signups, brand exposure, etc, whatever matters to your business is what you analyze, period. Create a list of priorities and work on getting the most accurate data for the statistics that are most influential to your decision making process.

5. Audit log file data with custom solutions
If you can get programmers to build an administration function that aggregates the number of times a thank you page was loaded, for example, and sort that report by day, month, and year, it will be a tremendous tool for you to audit your web conversion stats.

Having a tool to aggregate a basic, but highly important, statistic for auditing purposes is very helpful in ensuring that the data you have is accurate, and the methodologies used to get that data are sound and able to be replicated.

6. Get an experienced Internet metrics analyst
If all this is too much detail for you, but you still see the value in having highly accurate web data, you may need to hire someone whose job is to fully understand your business and determine how to use web metrics to gauge success at achieving business goals. Ideally, the person in charge of analytics needs to be part programmer, part analyst, and part marketer rolled into one:

· The programmer understands what technologies can be implemented to provide accurate data that the business feels comfortable using to make decisions. They fully understand the implications of web site design on web analysis.
· The analyst understands the needs of the business and what data the business needs to prove what's working and what isn't. This person is likely to sit in on the high-level business meetings and will help develop some of the business strategy.
· The marketer is the one that understands the customer and how to entice them to come to the site and more importantly how to get them to take the desired course of action.

As someone who has analyzed log files for over four years, I still have to refer to my checklist of things to look out for when performing analysis, the technology is always changing, and every site is programmed differently.

Bonus:
As of this article publication, I discovered that IE 6 often breaks up single PDF downloads into multiple smaller ones. I am still investigating the truth of this matter, but the data I currently analyze (over 20 sites) are showing that users who download PDFs and have IE6 often appear in my analysis program as if they are downloading the same PDF 5-8 times in about a 30 second timeframe, when they are actually only loading it once. To find out what I find from this and other data analysis issues, visit my site, and sign up for my newsletter.


Sign Up for my newsletter, it won't hurt!