Category Archives: Linguistics

A joke which requires lack of knowledge to understand

Usually “getting” a joke requires some subject knowledge. For example, you have to know who Michael Jackson is in order to understand Michael Jackson jokes. I came up with a joke which requires partial knowledge as well as partial ignorance. The joke is:

What emoji is missing on Chinese phones?

Answer: 少

The character is “shǎo,” meaning “to lack.” To me it also looks like a smiley face. But to someone who knows the language, it doesn’t look like a smiley face—it looks like shǎo.

I think you can see the predicament. In order to get the joke, you have to know the character, but not well enough to sight-read it. This reminds me of stories about catching Soviet spies with the Stroop effect. Supposedly, suspects were presented with Russian words for colors, but typeset in different colors than what the words said. Then they were asked to go through the list and say what color each word was printed in. If you can’t read Russian, it’s very easy. But if you can read the word, your brain gets confused. In this way, you can test somebody for a lack of knowledge.

The set of English puns is finite and enumerable

…so I decided to enumerate a subset of them. The particular form of pun I’m interested in is:

What’s the difference between a _____ and a _____? One is a xA B; the other is a A xB.

For example, what’s the difference between a skinny Spaniard and a skinny Russian? One is a slight Iberian; the other is a light Siberian.

To generate these, I took a word list of about 40,000 English words. It’s straightforward to find pairs of words matching the (xA, B, A, xB) template, as long as you choose the right data structures and pay attention to efficiency. But then we need to narrow those pairs down so that

  • xA and A are adjectives; xB and B are nouns, or
  • xA and A are nouns; xB and B are adjectives.

For this second task, I used WordNet, a lexical database from Princeton University. The Natural Language Toolkit (NLTK) provides helpful Python bindings to the necessary WordNet functions. Here is my Python script:

from nltk.corpus import wordnet as wn

f = open('wlist_match10.txt')
allpuns = open('all_puns.txt', 'w')

wordlist = []
wordset = set()
for line in f:
  word = line[0:len(line)-1]
N = len(wordlist)
# 22,282 words
# 113,135 permutation pairs with xA, B compatibility
# means 0.02% of permutations have xA,B compatibility
# fewer have part-of-speech compatibility as well

def isAdj(w):
  return len(wn.synsets(w, pos=wn.ADJ)) > 0
def isNoun(w):
  return len(wn.synsets(w, pos=wn.NOUN)) > 0
def candidate(xA, B):
  x = xA[0]
  A = xA[1:]
  xB = x + B
  if (isNoun(B) and isNoun(xB) and isAdj(A) and isAdj(xA)):
    pun = xA + ' ' + B + ', ' + A + ' ' + xB
  if (isAdj(B) and isAdj(xB) and isNoun(A) and isNoun(xA)):
    pun = xB + ' ' + A + ', ' + B + ' ' + xA

for j in range(N):
  xA = wordlist[j]
  x = xA[0]
  A = xA[1:]
  if (A not in wordset):
  for k in range(N):
    B = wordlist[k]
    if (x + B in wordset):
      candidate(xA, B)


# 6,696 results!

I didn’t bother to generate the full version of each pun using synonyms of the four words. It shouldn’t be too hard to do so, however. The script outputs a text file containing all 6,696 possible puns of the desired form.

Here are some highlights:

  • residential preparations, presidential reparations
  • dinky rain, inky drain
  • revolutionary ages, evolutionary rages