- is phonetic alphabet a better representation for text understanding?
- reduces dimensionality compared to characters
- stemming is implied by shared phonetics
practicalities
- network for ingesting hybrid input
- perf on characters
- perf on hybrid characters only
- perf on hybrid input
- get raw text
- function to go from raw text to representation
- tokenizer
- spaces between alphabetic words
- function to go from raw text to representation
- dict improvements
- add numbers [spell out]
- add years [spell out]
- add alphabetic words []
- add accented words
- tokenization (apostrophes, hyphens, abbreviations)
- odd tokens [br]