The set of English puns is finite and enumerable

…so I decided to enumerate a subset of them. The particular form of pun I’m interested in is:

What’s the difference between a _____ and a _____? One is a xA B; the other is a A xB.

For example, what’s the difference between a skinny Spaniard and a skinny Russian? One is a slight Iberian; the other is a light Siberian.

To generate these, I took a word list of about 40,000 English words. It’s straightforward to find pairs of words matching the (xA, B, A, xB) template, as long as you choose the right data structures and pay attention to efficiency. But then we need to narrow those pairs down so that

  • xA and A are adjectives; xB and B are nouns, or
  • xA and A are nouns; xB and B are adjectives.

For this second task, I used WordNet, a lexical database from Princeton University. The Natural Language Toolkit (NLTK) provides helpful Python bindings to the necessary WordNet functions. Here is my Python script:

from nltk.corpus import wordnet as wn

f = open('wlist_match10.txt')
allpuns = open('all_puns.txt', 'w')

wordlist = []
wordset = set()
for line in f:
  word = line[0:len(line)-1]
N = len(wordlist)
# 22,282 words
# 113,135 permutation pairs with xA, B compatibility
# means 0.02% of permutations have xA,B compatibility
# fewer have part-of-speech compatibility as well

def isAdj(w):
  return len(wn.synsets(w, pos=wn.ADJ)) > 0
def isNoun(w):
  return len(wn.synsets(w, pos=wn.NOUN)) > 0
def candidate(xA, B):
  x = xA[0]
  A = xA[1:]
  xB = x + B
  if (isNoun(B) and isNoun(xB) and isAdj(A) and isAdj(xA)):
    pun = xA + ' ' + B + ', ' + A + ' ' + xB
  if (isAdj(B) and isAdj(xB) and isNoun(A) and isNoun(xA)):
    pun = xB + ' ' + A + ', ' + B + ' ' + xA

for j in range(N):
  xA = wordlist[j]
  x = xA[0]
  A = xA[1:]
  if (A not in wordset):
  for k in range(N):
    B = wordlist[k]
    if (x + B in wordset):
      candidate(xA, B)


# 6,696 results!

I didn’t bother to generate the full version of each pun using synonyms of the four words. It shouldn’t be too hard to do so, however. The script outputs a text file containing all 6,696 possible puns of the desired form.

Here are some highlights:

  • residential preparations, presidential reparations
  • dinky rain, inky drain
  • revolutionary ages, evolutionary rages

Leave a Reply

Your email address will not be published. Required fields are marked *