2023-01-15

spwnn - fun with letter pairs

Below is a pointer to a PDF of the neuron sizes.  The "neuron size" is the number of words that contain that letter pair.

(The column to the left is the first letter, then scan along the columns to find the second letter.)

https://s3.amazonaws.com/above-the-garage.com/spwnnStuff/neuronSizes-49K-2023-01-14f.pdf

The print is small so it fits on a page - you can zoom in with your browser.

Lots of analysis can be done with this simple document (the source is in Excel).

For instance - what letter pairs are never used in my 49K dictionary?

The highlighted cells have no words assigned to them.

One can identify "unpopular" letter pairs. I don't see an overall pattern, except some letters are unpopular no matter what other letter they pair with.

For fun, suppose we are authoring a book and we want to invent some names and show clearly to the reader they are in an alien dialect.  (Like metaverse shows where the alternate universe always has floating airships.)

Some infrequently (or never) occurring letter pairs could be used to indicate alien words.

"jl" has no words in my dictionary.  Let's make up a character name using that fact: "Bojlus".  Clearly alien.

The letter 'q' is fantastically unpopular in my dataset, but let's not use that, as it's too obvious, and getting other usages in contemporary life.  How about ... "vf".  Another character name: "Savfin".

It works, right?

How about coloring by overall popularity?

In that last chart, the intensity of a cell is colored by the number of words attached.  Most cells are so dark you can't read the number - but a few are super popular, by a couple of orders of magnitude.

The most popular letter pair is "s_" or "s at the end of a word."  Next up in popularity is "in".

Interestingly, GPT-3, of the currently famous chatGPT, rarely uses individual letters.  spwnn tokenizes into letter pairs.  GPT tokenizes into an unknown number of single letters, pairs, triplets and quadruplets.  Here is how it tokenizes the alphabet:

Link to tokenizer doc: https://beta.openai.com/tokenizer

Their documentation, linked to above, but liable to disappear in the future, so I've copied the key bit, says: 

"A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).

If you need a programmatic interface for tokenizing text, check out the transformers package for python or the gpt-3-encoder package for node.js."


No comments:

Post a Comment