The Great Firewall of Santa Cruz

|The Volokh Conspiracy |

A funny assignment from the Univerity of California at Santa Cruz's Computer Science 13H course (Introduction to Programming/Data Structures), taught by Professor Darrell Long; see the PDF original if some of the subscripts, superscripts, and other features don't come through well here:

Hashing and Bloom Filters

You have been selected through thoroughly democratic processes (and the machinations of your friend Ernst Blofeld) to be the Dear and Beloved Leader of the Glorious People's Republic of Santa Cruz following the failure of the shorted-lived anarcho-syndicalist commune, where each person in turn acted as a sort of executive officer for the week. In order to promote virtue and prevent vice, and to preserve social cohesion and discourage unrest, you have decided that the Internet content must be filtered so that your beloved children are not corrupted through the use of unfortunate, hurtful, offensive, and far too descriptive language.

Bloom Filters

The Internet is very large, very fast, and filled with degeneracy. The proliferation of foreign persons who not only do not speak English, they use it in corrupt and deviant ways. The untermenschen spend their days sending each other cat videos. Your hero, B.B. decided that in his country of Oceania that a more neutral newspeak was required to keep the people content, pure, and from thinking too much. But how do you process so many words as they flow into your little country at 10 Gbits/second? The answer that comes to your brilliant and pure mind is that you use a Bloom Filter.

A Bloom Filter is a hash table, where the entries in the hash table are simply single bits. Consider hb(text) = k, then if Bk = 0 then the entry is definitely missing; if Bk = 1 then the entry may be present. The later is called a false positive, and the false positive rate depends on the size of the Bloom Filter and the number of texts that hash to the same position (hash collisions).

You decide to use bit fields in order to save space, with each unsigned char capable of holding 8 such bits. You know that you can reference bit Bk as a array in C and decide to provide your minions with these macros:

# define SETBIT(A, k) { A[k >> 3] |= (01 << (k & 07)); }

# define CLRBIT(A, k) { A[k >> 3] &= ~(01 << (k & 07)); }

# define GETBIT(A, k) (A[k >> 3] & (01 << (k & 07))) >> (k & 07)

You assign your programming minions to take every word that you can find in an English dictionary (which conveniently was posted to Piazza), and hash them into the Bloom Filter. That is, for each word the corresponding bit in the filter is set to 1.

When presented with a stream of text, it is first passed to the Bloom Filter. If the Bloom Filter rejects any words then the person responsible is guilty of a thoughtcrime, and that is very ungood indeed and so the Miniluv will caution him. If he does it again, you will send him off to joycamp. If all the words pass the Bloom Filter, then they are passed on to the next phase of processing in order to replace offensive, insensitive, and otherwise dangerous words with new approved words. The advantage is that your government can augment this list at any time via the Minitrue. Minitrue can also remove words from the Bloom Filter that you, in your wisdom, are not wholesome for your people to use.

Hash Table

You decide that the easiest way to make word replacements is through a translation table, with entries of the form: struct word { char *oldspeak char; *newspeak } that you will place into a hash table.

Once a word passes the Bloom Filter and thus is likely to be in your pure form of English, Ingspeak, you will use the hash table to locate that word, and if it is found then the system helpfully provide the appropriate new improved word in its place.

Do What?

  1. Build a Bloom Filter,
  2. Set the bits for all the words in the English list in the Bloom Filter,
  3. Build a hash table,
  4. Insert all of the words and the translations into the hash table, and finally
  5. When presented with a text, indicate whether
    • A throughtcrime has occurred, and in either case,
    • Provide helpful translations into newspeak that for words that where they exist.