UnicodeDB
This library aims to bring the unicode database to Nim. Main goal is having O(1) access for every API and be lightweight in size.
Note: this library doesn't provide Unicode Common Locale Data (UCLD / CLDR data)
Install
nimble install unicodedb
Compatibility
Nim +1.0.0
Usage
Properties
import unicode
import unicodedb/properties
assert Rune('A'.ord).unicodeCategory() == ctgLu # 'L'etter, 'u'ppercase
assert Rune('A'.ord).unicodeCategory() in ctgLm+ctgLo+ctgLu+ctgLl+ctgLt
assert Rune('A'.ord).unicodeCategory() in ctgL
echo Rune(0x0660).bidirectional() # 'A'rabic, 'N'umber
# "AN"
echo Rune(0x860).combining()
# 0
echo nfcQcNo in Rune(0x0374).quickCheck()
# true
Names
import unicode
import unicodedb/names
echo lookupStrict("LEFT CURLY BRACKET") # '{'
# Rune(0x007B)
echo "/".runeAt(0).name()
# "SOLIDUS"
Compositions
import unicode
import unicodedb/compositions
echo composition(Rune(108), Rune(803))
# Rune(7735)
Decompositions
import unicode
import unicodedb/decompositions
echo Rune(0x0F9D).decomposition()
# @[Rune(0x0F9C), Rune(0x0FB7)]
Types
import unicode
import unicodedb/types
assert utmDecimal in Rune(0x0030).unicodeTypes()
assert utmDigit in Rune(0x00B2).unicodeTypes()
assert utmNumeric in Rune(0x2CFD).unicodeTypes()
assert utmLowercase in Rune(0x1E69).unicodeTypes()
assert utmUppercase in Rune(0x0041).unicodeTypes()
assert utmCased in Rune(0x0041).unicodeTypes()
assert utmWhiteSpace in Rune(0x0009).unicodeTypes()
assert utmWord in Rune(0x1E69).unicodeTypes()
const alphaNumeric = utmLowercase + utmUppercase + utmNumeric
assert alphaNumeric in Rune(0x2CFD).unicodeTypes()
assert alphaNumeric in Rune(0x1E69).unicodeTypes()
assert alphaNumeric in Rune(0x0041).unicodeTypes()
Widths
import unicode
import unicodedb/widths
assert "🕺".runeAt(0).unicodeWidth() == uwdtWide
Scripts
import unicode
import unicodedb/scripts
assert "諸".runeAt(0).unicodeScript() == sptHan
Casing
import sequtils
import unicode
import unicodedb/casing
assert toSeq("Ⓗ".runeAt(0).lowerCase) == @["ⓗ".runeAt(0)]
assert toSeq("İ".runeAt(0).lowerCase) == @[0x0069.Rune, 0x0307.Rune]
assert toSeq("ⓗ".runeAt(0).upperCase) == @["Ⓗ".runeAt(0)]
assert toSeq("ffi".runeAt(0).upperCase) == @['F'.ord.Rune, 'F'.ord.Rune, 'I'.ord.Rune]
assert toSeq("ß".runeAt(0).titleCase) == @['S'.ord.Rune, 's'.ord.Rune]
assert toSeq("ᾈ".runeAt(0).caseFold) == @["ἀ".runeAt(0), "ι".runeAt(0)]
Segmentation
import unicode
import unicodedb/segmentation
assert 0x000B.Rune.wordBreakProp == sgwNewline
Related libraries
Storage
Storage is based on multi-stage tables and minimal perfect hashing data-structures.
Sizes
These are the current collections sizes:
- properties is 40KB. Used by
properties(1)
,category(1)
,bidirectional(1)
,combining(1)
andquickCheck(1)
- compositions is 12KB. Used by:
composition(1)
- decompositions is 89KB. Used by
decomposition(1)
andcanonicalDecomposition(1)
- names is 578KB. Used by
name(1)
andlookupStrict(1)
- names (lookup) is 241KB. Used by
lookupStrict(1)
Missing APIs
New APIs will be added from time to time. If you need something that's missing, please open an issue or PR (please, do mention the use-case).
Upgrading Unicode version
Note: PR's upgrading the unicode version won't get merged, open an issue instead!
- Run
nimble gen
to check there are no changes to./src/*_data.nim
. If there are try an older Nim version and fix the generators accordingly - Run
nimble gen_tests
to update all test data to current unicode version. The tests for a new unicode version run against the previous unicode version. - Run tests and fix all failing tests. This should require just temporarily commenting out all checks for missing unicode points.
- Overwrite
./gen/UCD
data with latest unicode UCD. - Run
nimble gen
to generate the new data. - Run tests. Add checks for missing unicode points back. A handful of unicode points may have change its data, check the unicode changelog page, make sure they are correct and skip them.
Tests
Initial tests were ran against [a dump of] Python's
unicodedata
module to ensure correctness.
Also, the related libraries have their own custom tests
(some of the test data is provided by the unicode consortium).
nimble test
Contributing
I plan to work on most missing related libraries (case folding, etc). If you would like to work in one of those, please let me know and I'll add it to the list. If you find the required database data is missing, either open an issue or a PR.
LICENSE
MIT