Too many anigrams

Question: In the game of Anigrams, you unscramble successively larger, nested collections of letters to create a valid “chain” of six English words between four and nine letters in length.

For example, a chain of five words (sadly, less than the six needed for a valid game of Anigrams) can be constructed using the following sequence, with each term after the first including one additional letter than the previous term:

DEIR (which unscrambles to make the words DIRE, IRED or RIDE) DEIRD (DRIED or REDID) DEIRDL (DIRLED, DREIDL or RIDDLE) DEIRDLR (RIDDLER) DEIRDLRS (RIDDLERS) What is the longest chain of such nested anagrams you can create, starting with four letters?

For specificity, all valid words must come from Peter Norvig’s word list (a list we’ve used previously here at The Riddler).

Extra credit: How many possible games of Anigrams games are there? That is, how many valid sets are there of four initial letters, and then five more letters added one at a time in an ordered sequence, that result in a sequence of valid anagrams? (Note: Swapping the order of the first four letters does not result in a distinct game.)

(FiveThirtyEight)

Solution

The sketch shows a winning sequence of plays for a game of anigrams on the left, along with the underlying game on the right. For any given round, there may be multiple words that can be formed from the letter set.

To build a valid game of anigrams, we can pick a set of letters from a $4$-letter word, then add letters one at a time such that, at every stage, one or more real words can be formed from the letters.

Algorithm

This suggests an efficient way to build anigram games.

turn all $4$-letter words into barcodes and add them to a set good_barcodes.
go through each $5$-letter word, remove one letter at a time to form the barcodes that could have preceded it, and check if any of them are already in the set of good_barcodes. If at least one is, barcode the $5$-letter word and add it to the list of good_barcodes.
repeat the process for all $6$-letter words, $7$-letter words, and so on.

This ensures that all barcodes in the set have an ancestor in the set and therefore could be in a game of anigrams.

At the end of this process, the longest barcode(s) in good_barcodes will be the anigram game(s) we’re looking for.

But knowing that the game exists is one thing — we’d also like to efficiently recreate it. To do this, we’ll log a precursor word for each of a word’s valid precursor-barcodes in the map precursors.

Coding

To start, we need to write a function to turn any word into the corresponding barcode

def word_to_barcode(word):
  return "".join(sorted(word))

and initialize the set good_barcodes with the barcodes

# list of words of length 4
initial_words = [word for word in words if len(word) == 4]

# keep track of which words have sub-barcodes
good_barcodes = set(word_to_barcode(word) for word in initial_words)

To convert from barcodes back to words, we make a map from barcodes to words

from typing_extensions import DefaultDict
barcode_to_word = DefaultDict()

for word in words:
  barcode_to_word[word_to_barcode(word)] = word

Now, we build the set good_barcodes with a linear pass over words

# store precursor words
precursors = DefaultDict(lambda: set())

for word in words:
  
  # make potential precursor barcodes for each word by removing each letter
  precursor_barcodes = [word_to_barcode(word[0:i-1] + word[i:]) for i in range(1, len(word)+1)]
  
  # loop over the potential precursor-barcodes to see if any already exist in good_barcodes
  # if so, add current barcode to good_barcodes, log a sample precursor-word in precursors
  for precursor_barcode in precursor_barcodes:
  
    if precursor_barcode in good_barcodes:
      good_barcodes.add(word_to_barcode(word))
      precursors[word].add(barcode_to_word[precursor_barcode])

With this in hand, we can find the barcodes of maximum length

longest = max(good_barcodes, key=lambda x: len(x))
longest_barcodes = [bc for bc in good_barcodes if len(bc) == len(longest)]
longest_words = [w for w in words if word_to_barcode(w) in longest_barcodes]
list(zip(longest_barcodes, longest_words))

which produces

[('adeeiiimnnnorstt', 'indeterminations'),
 ('adeeiimnnorssttu', 'underestimations')]

So, there are two longest anigram games, terminating in the $16$-letter words indeterminations and underestimations.

We can walk backwards through precursors to resurrect sample gameplay.

For indeterminations we find

['indeterminations',
 'intermediations',
 'determinations',
 'antimodernist',
 'terminations',
 'nitrosamine',
 'antinomies',
 'nominates',
 'sonatine',
 'stanine',
 'anenst',
 'anent',
 'neat']

and for underestimations

 ['underestimations',
 'underestimation',
 'determinations',
 'antimodernist',
 'terminations',
 'nitrosamine',
 'antinomies',
 'nominates',
 'sonatine',
 'stanine',
 'anenst',
 'anent',
 'neat']

The games are identical up until the second to last word, where adding a u or an i is the determining step. At the outer limits of the game, we should expect exceptional sequences of mutations. Perhaps it’s not so surprising that the two longest games have the same origin.

Extra credit

We can use the precursors map to count the number of valid anigram games starting from a given terminal word. We just recurse through precursors and add up the number of possible paths, returning 1 when we reach a $4$-letter word. This is the proper counting because precursors only stores one representative precursor word for each precursor barcode (e.g. it doesn’t log both care and race).

from functools import lru_cache

@lru_cache(maxsize=1_000_000)
def count_anigrams(word):
  # count the number of anigram paths for a given word
  if len(word) == 4:
    return 1
  else:
    return sum(count_anigrams(p) for p in precursors[word])

Summing over all possible terminal words at each stage, there are $4,510,515$ possible regulation games of anigrams, and the number of distinct games hits a max around terminal words of length $11$ or $12$.

For the two longest games (ending in indeterminations and underestimations), the possibilities split up into $1,071$ possibilities apiece.

\[\begin{array}{l|l} \text{Terminus length} & \text{Distinct games} \\ \hline 4 & 2674 \\ 5 & 15915 \\ 6 & 82932 \\ 7 & 403416 \\ 8 & 1603310 \\ 9 & 4510515 \\ 10 & 8660949 \\ 11 & 11437610 \\ 12 & 10826426 \\ 13 & 6614053 \\ 14 & 2394474 \\ 15 & 623854 \\ 16 & 2142 \\ 17 & 0 \end{array}\]

Since there are no games of length $17,$ there can’t be games of length greater than $17,$ and $16$ is the maximum.