Web Content Filtering and Censorship

Traffic filtering might be a legal or compliance requirement for some networks, it might be deployed to prevent the network being used to access content that’s illegal, or even to protect against browsers downloading harmful data from particular servers. Conversely, individuals might want to set up something like Privoxy, to protect their privacy and gain some control over the data they share with third parties.

Of course, sometimes traffic filtering is done for political or ideological reasons.
For example, ongoing research by the Harvard Law School, Dynamic Internet Technology and the rest of the Global Internet Freedom Consortium - probably outdated now - suggests the majority of sites inaccessible from within China are blocked for mainly political reasons.

It's important to understand roughly how the different types of traffic filtering work, and how these types of filtering are often combined in traffic blocking systems.
From the client’s point of view, a given connection is either allowed or disallowed, typically according to some rule-based system acting on certain layers within the TCP/IP packets being sent or received. Commercial systems tend to combine several layers of filtering.

In a TCP/IP packet, there is the Internet Protocol (IP) component, which deals with how the packet is routed - this will include source address, destination address and TTL fields. The TCP component deals with the content of the data being communicated, and will have fields for the payload, session control flags, port number, etc.

Additionally, before the TCP/IP packet is even sent, the client might send a DNS request to determine the IP address for the URL. There are methods of preventing communication by blocking requests for certain URLs.

A paper called The Great Firewall Revealed (Global Internet Freedom Consortium) discusses the three methods in detail, but the following are basic descriptions of how they work, and how each can be countered individually.

IP Address Filtering

The destination IP address is checked against a blacklist of known addresses, whether they refer to blacklisted or proxy servers. In larger networks this is more commonly deployed at the gateway level, as TCP inspection can be resource-intensive where high loads are involved. Even in relatively advanced systems, sites must be reviewed and added to the blacklist manually, which makes it ineffective against undiscovered proxy services that become available.

TCP Filtering

Works by inspecting the packet being routed to determine whether given keywords exist in the payload. This is how the traffic filtering system determines whether to block or allow the site based on the content of the web pages, or whether the user has submitted certain information or queries. This can usually be defeated by encrypting the payload, usually through SSL/HTTPS by changing the URL from ‘http://‘ to ‘https://‘, assuming the browser doesn't have a TLS certificate installed that's signed for the organisation doing the filtering.

Domain Redirection

DNS and URL blocking are slightly different - one works on the TCP/IP packet and the other on the DNS lookup requests. Here the URL is scanned for keywords, or compared with a domain blacklist prior to the request being resolved, redirected or dropped. Many otherwise decent proxy services become unavailable on certain networks simply because their URLs contain the word ‘proxy’.
Certain domains can also be mapped to incorrect IP addresses in the local DNS, perhaps causing URLs to be redirected to some error/warning page instead of the actual site. This is easily defeated, if the IP address itself is known in advance, or the client can be configured to use a different DNS service.

An Overview of Proxy Servers

A basic proxy server simply relays traffic between the client and the server, effectively enabling the client to access a blocked server through a different IP address and URL which the filtering system doesn’t recognise. The commonly available web proxies are servers running PHP forms that users enter the URL of whatever sites they wish to visit.

Most proxies themselves shouldn’t be trusted to relay sensitive data, such as bank account information or login details. And, if a law enforcement agency has a compelling reason to determine your identity, it's more than likely able to determine your IP address by acquiring the server logs from whoever is operating the proxy.

How Web Content Filtering Can Be Defeated

All the above techniques are effective against the different methods of filtering traffic, and they can be combined for getting around reasonably advanced traffic filtering. Here the objective is to find a proxy service that uses SSL/HTTPS at an unrecognised IP address, and with a URL that doesn’t contain any keywords suggesting its function.

  1. The first step is to consult a search engine, such as Google or StartPage, and get a list of curretly available proxies. It might be necessary to copy and paste the listed URLs in the address bar, and change ‘http://' to ‘https://‘. More often than not, there’ll be warnings about invalid SSL certificates, but they can be ignored here.
  2. Unfortunately most the URLs listed will contain the word ‘proxy’ or other keywords that suggest their purpose. What’s needed are the ones without a suggestive URL.
  3. The next problem is the vast majority of web-based proxies have home pages containing keywords, and TCP filtering will cause this traffic to be blocked. Therefore, the TCP payload must be encrypted to prevent the content being scanned. This can be done by copying and pasting the selected URLs from the proxy list into the address bar of the browser, and again replacing ‘http://‘ with ‘https://‘. In most cases there’ll also be warnings here regarding dodgy SSL certificates, but these can be ignored unless there’s a real need to authenticate the server.