Category Archives: Python

The set of English puns is finite and enumerable

…so I decided to enumerate a subset of them. The particular form of pun I’m interested in is:

What’s the difference between a _____ and a _____? One is a xA B; the other is a A xB.

For example, what’s the difference between a skinny Spaniard and a skinny Russian? One is a slight Iberian; the other is a light Siberian.

To generate these, I took a word list of about 40,000 English words. It’s straightforward to find pairs of words matching the (xA, B, A, xB) template, as long as you choose the right data structures and pay attention to efficiency. But then we need to narrow those pairs down so that

  • xA and A are adjectives; xB and B are nouns, or
  • xA and A are nouns; xB and B are adjectives.

For this second task, I used WordNet, a lexical database from Princeton University. The Natural Language Toolkit (NLTK) provides helpful Python bindings to the necessary WordNet functions. Here is my Python script:

from nltk.corpus import wordnet as wn

f = open('wlist_match10.txt')
allpuns = open('all_puns.txt', 'w')

wordlist = []
wordset = set()
for line in f:
  word = line[0:len(line)-1]
  wordlist.append(word)
  wordset.add(word)
N = len(wordlist)
# 22,282 words
# 113,135 permutation pairs with xA, B compatibility
# means 0.02% of permutations have xA,B compatibility
# fewer have part-of-speech compatibility as well

def isAdj(w):
  return len(wn.synsets(w, pos=wn.ADJ)) > 0
def isNoun(w):
  return len(wn.synsets(w, pos=wn.NOUN)) > 0
def candidate(xA, B):
  x = xA[0]
  A = xA[1:]
  xB = x + B
  if (isNoun(B) and isNoun(xB) and isAdj(A) and isAdj(xA)):
    pun = xA + ' ' + B + ', ' + A + ' ' + xB
    allpuns.write(pun+'\n')
  if (isAdj(B) and isAdj(xB) and isNoun(A) and isNoun(xA)):
    pun = xB + ' ' + A + ', ' + B + ' ' + xA
    allpuns.write(pun+'\n')    

for j in range(N):
  xA = wordlist[j]
  x = xA[0]
  A = xA[1:]
  if (A not in wordset):
    continue
  for k in range(N):
    B = wordlist[k]
    if (x + B in wordset):
      candidate(xA, B)

f.close()
allpuns.close()

# 6,696 results!

I didn’t bother to generate the full version of each pun using synonyms of the four words. It shouldn’t be too hard to do so, however. The script outputs a text file containing all 6,696 possible puns of the desired form.

Here are some highlights:

  • residential preparations, presidential reparations
  • dinky rain, inky drain
  • revolutionary ages, evolutionary rages

Simple text encryption in Python

Here’s a picture:

Encryption Diagram

If we had a set of data consisting entirely of the lowercase letters a, …, f, then we could encrypt it using this diagram, so long as we were given a key. Furthermore, no information is lost in the process; so long as we remember the net shifting that has taken place, we can always recover the original data.

To apply this to text files consisting of more characters than the six in the diagram above, we can use decimal ASCII character codes, which are integers ranging from 0 to 127. For example, the code for “Q” is 81, that for “q” is 113, and that for “$” is 36.

Okay, that’s about it for the conceptual stuff. Now for the nuts and bolts:

# En/decrypts a text file

import math

fname = raw_input('Filename: ');
shift = int(input('Key: '));

f = open(fname, 'r+');
text = list(f.read()); # split file contents into list of characters

for k in range(0,len(text)-1): # loop through characters in file
    # use decimal ASCII codes (0-127)
    # shift character code along circle of ...-126-127-0-1-... and convert back to character
    text[k] = chr((ord(text[k])+shift)%128);

text = ''.join(text); # convert en/decrypted list of characters into string
f.seek(0); # go to start of file
f.write(text);
f.close();

This script prompts the user for the filename of the text file, and for a key that corresponds to the character shifting that is done. Suppose I encrypt a human-readable file with key 53, then encrypt the resulting file with key –21. Then all I need to do to recover the original data is to run the script again, this time with key –(53–21) = –32.

Protip: for extra sneakiness, you can give the script and data deceptive file extensions like .mp3 or .jpg.