I have not had time to work on phpSATk for quite a while (there will be some progress soon ... promised), the squid negotiate helper is kind of ready (I have been using it in a test environment for some time now), there will soon be a publicly available version (the impatient might have a look at the
svn version) and my PHP CRL patch will propably included in PHP HEAD as soon as I manage to put together some test cases and possibly 5.3 afterwards.
But the project I spent the most time on in the last weeks was building a
ICAP based web filtering engine. In the past I felt like all existing (open source) web filters have major shortcomings:
- completely relying on URL black-/whitelists totally sucks - the number of false positives and false negatives is extremely high, higher quality blacklists are expensive, you never know what the political/commercial/whatever interests of people/institutions putting sites on these blacklists are.
- the only free content filtering solution I know about is
dansguardian which relies on proxy chaining what sucks when it comes to authentication. This approach also is not as flexible as I'd like it to. The licensing terms are imho a bit too restrictive.
- All solutions I know are not really configurable at run time. In production use I need to possibility to make online changes to the black/whitelists and or wordlists without causing connections to disrupt and/or increased latency (I'd consider writing to configuration files and/or black-/whitelists from a webapp inacceptable).
so I'm trying to build a solution having the following properties:
- uses the ICAP standard (squid3 is coming...)
- will be probably licensed under GPL
- will be a hybrid solution combining the results of content analysis and url filtering
- content analysis should include reliable word/phrase matching as well as parsing of PICS tags.
- will be based on scores (one for each category) which will be used to match a profile of allowed sites
- contains efficient "database engines" for the different datatypes used - each of them manageable in real-time through an RPC (XMLRPC for now) interface.
- should be extremely fast (it already uses threads and asynchronous i/o) and scaleable
- ... will integrate with phpSATk as administrative interface
So far I have a working prototype which can already deny/allow access based on: server ip (this is really useful for some popular sites which have hundreds of aliases), host/domainname and regular expressions. Both parsing PICS tags as well as reliable and fast phrase/word matching are very hard to implement - so maybe this will need some time until I can show something working.
I'll announce this thingy (and the negotiate/GSS helper) to the squid users/dev soon so maybe somebody volunteers to contribute.
Update: squid-3.0-stabe1 is released ... its time to have this thingy working ... the icap protocol implementation is kind of feature complete now (previews, persistent connection are implemented)