Weblogs: Web Accessibility

SiteMorse fails due diligence

Thursday, July 07, 2005

SiteMorse have taken another swipe at the accessibility community. They have rerun their secretive script, and posted the results. From the looks of it, they haven't properly analysed the data before leaping into the press with conclusions like "How can everyone else be expected to achieve website accessibility, if the experts can't?"

SiteMorse learning a bitter lesson

It looks like SiteMorse has made a tiny bit of progress since their previous PR disaster. The last two press releases contained the following climb-down disclaimer:

No one at SiteMorse is saying that automated tests are the Holy Grail, simply if you can not pass the automated tests and for instance, have basic descriptions missing on images, how can you hope to achieve compliance?

What SiteMorse are suggesting is that passing their automated tests should be one milestone along the path to accessibility. The implicit suggestion is that you should use SiteMorse to identify a set of issues, fix those, then use manual checks to cover the guidelines SiteMorse doesn't test. It might sound reasonable, but in the real world, there's a massive flaw.

The weakness of automating human judgements

Automated test tools excel when the decision logic can be boiled down to a simple yes or no answer, and that answer can be determined using machine accessible data. Is the page valid? There's no in-between, its either yes or no. Automated tools can handle this, since the information they need to make the decision is unambiguously accessible to them. A missing alt attribute on an image is meat and potatoes for a script. Easy to assess, resulting in a simple yes or no. And the tests can be repeated ad nauseam.

Where automated tools fail is when human judgement is required. For example, is the alternative content on an image a reasonable equivalent to the image content? This requires two human judgements: evaluating what the content of the image is, and compare it to the textual equivalent and see if they provide the equivalent information.

Of course, if tools like SiteMorse could guarantee that their tools test only the questions where there is a binary answer based on simple machine accessible data, then their tool would be almost perfect, yet limited. (The almost is a caveat because scripts are written by humans. And humans do make mistakes).

Yet the SiteMorse tool isn't limited to questions it can reliably answer. From some of the in-depth reports they provide, its clear the SiteMorse tool makes certain educated guesses - human judgement - on certain questions. For instance, it tries to decide whether the alt attribute of an image contains reasonable textual content.

Automated falsehoods

The problem with using automated tools is when it has to make a decision, or a snap judgement, its open to make a mistake. Even if its 95% accurate in its decision making, the other 5% of wrong decisions can seriously undermine the benefit of using an automated tool. There are two types of wrong decisions in automated tools: false positives and false negatives.

A false positive is when something correct is reported as being incorrect. For example, say you have a spam filter for your email. It looks for spam and removes it from your inbox either into a pending file or deletes the email. Now a false positive is an email from your best friend that the spam filter decides is spam. Making this mistake has serious repercussions, if you don't regularly scan through your spam folder looking for these false positives you may end up not reading your friend's email. If you trust the spam filter enough to delete emails it considers spam, you'll never see the mistake. False positives are a bad thing for spam filters, but for automated accessibility checkers they are less dangerous but far more frustrating.

A false negative is when something incorrect is marked as being correct. In our spam filter example, a false negative is a piece of spam email that the filter doesn't consider spam. So it puts it in your inbox. So you have to manually check it out, and perhaps delete it. It sounds like a little problem, but if you don't make adjustments to your filter and teach it that this particular email is spam, it will keep making the same wrong decision. For automated accessibility checkers, this is the major problem.

Dealing with false positives

Dealing with false positives is easy. You work your way through the list of errors that SiteMorse have produced. Before you "fix" the problem, you evaluate whether the issue raised is actually a problem. This is absolutely essential on checkpoints where a degree of human judgement is required. Automated scripts can regularly make mistakes when trying to make judgement calls.

Of course, a missing alt attribute is pretty straightforward for a script to detect. Right? Yet, I have a documented example of Bobby reporting missing alt attributes, and they were actually all present and accounted for!

So even on tests where automated tools should be excelling, we should not take it for granted that these tests have been done. That also means, tests requiring human judgement should always be manually checked and confirmed, and before making any corrections.

SiteMorse initially slated the Guild of Accessible Web Designers website for failing even the most basic level of accessibility. GAWDS actually scored 99% compliance on SiteMorse's Level A checks. The one "failure" that was reported was an inappropriate alt attribute for an image. A manual check was done, and the result was that SiteMorse was in error in generating a false positive. A GAWDS Whitepaper covers this particular issue.

The downside of false positives is that the same "error" will be picked up the next time the automated script is run. What point is an error report if its continually flagging up items that have been manually checked and confirmed that they are not errors at all?

One solution is to give in to the tool, and change the particular item so that the tool no longer flags it up. But then we are making a change to satisfy a mistaken tool, and not for accessibility reasons. This path leads to SiteMorse-optimised websites.

The Red Ant homepage is a classic example of a SiteMorse optimised website. It passes SiteMorse tests with 100% in both Single-A and Double-A automated tests, the site itself claims Triple-A conformance, yet the homepage alone fails 17 individual checkpoints, three of which are Priority 1 issues, and 10 of them are Priority 2 issues. That's a shocking failure for a claimed perfect site. But a very common characteristic of SiteMorse-optimised websites.

Dealing with false negatives

False negatives are more difficult to spot. An automated tool may miss that a particular item in a page is an accessibility problem - even if it is a checkpoint that it should always get right. We cannot assume an automated tool has found all the occurrences of a particular error.

The Bobby example above also made a mistake of missing that a form on the page was totally dependent on JavaScript. It should have detected it, but failed. It was only because I manually checked the result did I uncover Bobby's error.

Manual due diligence

From examining the repercussions of false positives and false negatives it becomes clear that the value of an automated test lies in manual confirmation. That is the only way to validate whether the automated checks have been correct. And the only way to correctly determine which of the SiteMorse flagged errors are actually accessibility problems.

Judging from the limited information in the press releases, there's no evidence that the findings of the SiteMorse tool have been manually checked by accessibility experts before publication. It seems like the people behind SiteMorse rely solely on the tool they have produced. It all suggests that the publications of these reports didn't include a manual due diligence check of the results.

The shadow of SiteMorse

The SiteMorse product isn't publicly visible. There's no mention of pricing on their site. The website itself makes a series of unsubstantiated claims. Also, there's very little sign of the tool being peer-reviewed by accessibility experts.

The SiteMorse product itself looks to be closed source, so its impossible for experts to analyse the logic to determine how accurate its automated test functions really are. The expertise of the developers is limited to a cluster of superficial questions to Usenet groups.

There's also no publicly available engine to evaluate, so there's no way of actually testing SiteMorse itself. All we have to go on are the biased press releases from SiteMorse themselves, sound bites from clients who show no signs of a deep understanding of accessibility.

Contact with SiteMorse seems to be via email only. You have to provide them with your contact details before they will decide whether to run a 10-page test against a nominated website. That makes it fundamentally impossible to do a proper evaluation of the SiteMorse tool.

And the safest course of action is to avoid the SiteMorse product. At least until SiteMorse decide to be more open about the details of the product. Preferably not drowned out in marketing speak.

A more publicly accessible SiteMorse

SiteMorse on the other hand could make their product more public friendly. Starting with a detailed list of checkpoints that are tested and how they are tested, and when these checkpoints require human judgement, an explanation of the logic the script uses to arrive at a decision.

The second improvement is for SiteMorse to manually check the results, and confirm what's been flagged as an error is really an accessibility error. They should be doing the manual due diligence before releasing a report criticising others.

SiteMorse also need to get over the fact that the DRC will not endorse their product. This pointless bickering is not winning SiteMorse any friends or potential clients.

[Update 7th July 2005] SiteMorse have edited their initial press release. The main change was to remove the reference to GAWDS in the list of accessibility organisations that failed the accessibility checks, and replaced it with a spokesperson comment how amazing it is to see the GAWDS website score the highest on the first time it was tested (but no mention that SiteMorse scored GAWDS at 99% and 96% for Single-A and Double-A checkpoints, and that GAWDS have proved that the Level-A failure is a mistake on SiteMorse's part).

But the damage has already been done, the media have already picked up the incorrect first edition of the SiteMorse press release. The SiteMorse automated tool has got it wrong - there's an appropriate headline.