Checklist: Best Practices for Handling String Data (SBP)
SBP1
When reading text, specify the correct text encoding
to use.
The Python
chardet
library can help infer encodings.
SBP2
When writing text, specify the text encoding
actually used.
When in doubt, use
UTF-8
.
SBP3
Encode and decode as close to
I/O
as possible:
See
☞
Unicode sandwich
.
SBP4
Consider normalizing text on input.
(probably to
NFC
or
NFKC
).
Handle problematic code points (
Bray & Hoffman, 2025
).
SBP5
If necessary, consider normalizing text before output.
(probably to
NFC
).
SBP6
Consider further string normalizations on read if useful.
(e.g. to
☞
NFTK
).
SBP7
Consider normalizing when performing string comparisons.
SBP8
Note: string length and glyph count are not always the same.
e.g.
len('
👩❤️💋👨
') = 10
;
tdda.utils.number of glyphs('
👩❤️💋👨
') = 1
.
Submit
Clear