Checklist: Best Practices for Handling String Data (SBP)

SBP1

When reading text, specify the correct text encoding to use.

The Python chardet library can help infer encodings.

SBP2

When writing text, specify the text encoding actually used.

When in doubt, use UTF-8.

SBP3

Encode and decode as close to I/O as possible:

See ☞ Unicode sandwich.

SBP4

Consider normalizing text on input.

(probably to NFC or NFKC).

Handle problematic code points (Bray & Hoffman, 2025).

SBP5

If necessary, consider normalizing text before output.

(probably to NFC).

SBP6

Consider further string normalizations on read if useful.

(e.g. to ☞ NFTK).

SBP7

Consider normalizing when performing string comparisons.

SBP8

Note: string length and glyph count are not always the same.

e.g. len('👩‍❤️‍💋‍👨') = 10; tdda.utils.number of glyphs('👩‍❤️‍💋‍👨') = 1.