Web Filtering and AI Evaluation

By Smoothwall Education
Published 15 August, 2019

4 minute read

These days it’s unusual to find a web filter vendor not making use of machine learning or intelligence somewhere in their products. But how can you compare them?

Artificial Intelligence systems are essential to keep up with user-generated content and the ever evolving list of filter avoidance tools. These systems are usually effective against similar, but widespread types of content, such as pornographic material, gambling sites or anonymizer tools.

It’s difficult to compare the underlying technology however, largely because it’s possible to use AI in a multitude of different ways.

For example, closed-loop learning, human-directed learning, and then various models beneath, such as simple HMM or tensorflow. All of these techniques can be applied well or poorly.

The most important question to ask is where does your filter apply these AI techniques?

It’s commonly in one of two areas:

In line with the web filtering in real-time

Real–time filtering is either baked into a network appliance, or as part of a filtering client. You’ll see occasional updates to the rules database, but other than that, the filter makes all the decisions locally. 

Out-of-band offline processing 

With out-of-band intelligence, uncategorised URLs are fed back to the filter vendor, and the site is then visited by an automated web crawler or “spider”. The results are then passed through the intelligent system, and a categorisation attached to the URL. The categorisation makes it back to the point of filtering in regular URL list updates. 

	In line	Out of band
Speed of Reaction	Instant. Any filtering decision is applied straight away, leaving no opportunity for harmful content to get by.	Hours. Unknown content is queued waiting for the offline process to occur. Filtering is then caught up at the next regular update.
Effectiveness: Real-time Content	Excellent – real-time or rapidly changing content is reassessed each time, so a correct decision is made against up to date data.	Poor – generally the categorisation of a site is either permanently fixed, or fixed for months. This leaves sites with changing content open to misclassification.
Effectiveness: Context	Weak. Inline filters only see one page at a time and can’t make decisions based on what’s linked to.	Strong – with plenty of time to make a decision, an out-of-band filter can download links and images.
Effectiveness: Logged-in Content	Excellent – as these filters work on the data the user sees, even content behind a login such as a forum or social media will get scanned.	Useless – the out of band filter sees only the login page, which rarely provides any actionable content.
Additional Latency	Low – usually adding intelligence will add latency to each request. Properly designed systems will limit this, so it isn’t noticed by the user.	Zero – as all intelligence is out of band, there’s no additional latency.

Looking at this table it’s clear that an inline filter is far more effective against today’s web which is increasingly volatile, and often behind a login. It’s also worth noting that an inline approach does not preclude additional out-of-band filtering – if you can find a vendor that combines these you will get the best of everything. 

But how can you choose a vendor? Not only do you want the right technical capabilities, you need to make sure they’re able to meet the requirements set out in the KCSIE and the UK Safer Internet Centre guidance.

Additionally, the vendor you choose must be able to deploy your solution in a suitable way for your school, college or MAT. That’s a lot to consider.

We talk you through the process of choosing a vendor, as well as how to select the right deployment option for you in our free whitepaper – to help you make the right decision.