Modeling your Data 数据建模
How to model your data with CanDB
A) The Entity
The Entity is the equivalent of the base data record or item that is stored in CanDB. It consists of: 实体相当于存储在 CanDB 中的基本数据记录或项目。 它包括:
- Partition Key (PK) - A string key referring to a distinct partitioned space comprised of a single canister or auto-scaled group of canisters where the entity will resideNOTE:
In CanDB, creating a new PK will create a new canister, and auto-scaling refers to creating new canisters that have that same PK 字符串键,指的是由实体所在的单个容器或自动缩放的容器组组成的不同分区空间注意:在 CanDB 中,创建新的 PK 将创建一个新的容器,而自动缩放是指创建具有相同 PK 的新容器
- Sort Key (SK) - A string key, which can be used to provide direct or range sorted access to a single entity or range of entities. 字符串键,可用于提供对单个实体或实体范围的直接或范围排序访问。
- Attributes - A series of key value pair attribute metadata 一系列键值对属性元数据
1) Partition Keys
Partition keys are used to partition the data stored throughout your application.
In CanDB, creating a new PK will create a new canister with that PK - this can also be referred to as creating a partition. When a canister inside a partition hits its storage limits, it will auto-scale, which means that it will create a new canister within that partition that has the same PK as all other canisters in that partition.
2) Sort Keys
a) Sort Key Best practices
i) Represent Hierarchical data with compound sort keys
Many times data is hierarchical and defines one-to-many relationships, where each item can be linked together by a parent-child relationship in an overall tree structure. You can think of this in terms of social media applications like Reddit, which has subreddits, which have posts, comments on those posts, and replies to each comment.
You can use compound sort keys to represent this data, separating various fields with a “#” or a delimiter of your choice.
This way, the following SK can be used to iterate through all comments within a post, or all comments within a subreddit
subreddits#<subreddit_id>#posts#<post_id>#comment#<comment_id>
The sort key below (notice the plurality of comments) can be used to iterate through all replies to a comment
subreddits#<subreddit_id>#posts#<post_id>#comments#<comment_id>#reply#<reply_id>
Here’s another example from AWS’s DynamoDB showing how you can use sort keys to model geographic hierarchical data, and version your data.
ii) Lexicographic integer encodings
As all sort keys are strings, this means they are sorted in lexicographical, or “alphabetical” order.
While using words works well in sort keys, numbers or timestamps are not sorted in the same numeric order as intended.
For example, If you store the following sort keys in CanDB,
- user#20
- user#5
- user#307
They will sort themselves alphabetically in the following order, instead of numerically as you may desire (say if you want to order data by upvote score)
[ “user#20”, “user#307”, “user#5” ]
Converting each integer and using a lexicographic integer encoding with something like Motoko or JavaScript will preserve the numeric ordering, with the resulting ids ending up like this:
- 5 -> 05
- 20 -> 14
- 307 -> fb38
Each of the string encodings of the numeric ids are then sortable as intended within a string
[ “user#05”, “user#14”, “user#fb38” ]
iii) Use ULIDs instead of timestamps
Commonly, you’ll want to allow access to some latest n entries by timestamp, (getLatestComments(), getLatestPurchases()), where you many want to add a timestamp to your sort, for example
getLatestComments() commentTimestamp#<comment_timestamp>#comment#<comment_id>
or
getLatestPurchases() purchaseTimestamp#<purchase_timestamp>#comment#<purchase_id>
As timestamps are integers, sorting timestamps runs into the same issue as trying to sort integers. While one can use timestamps, in the case that two separate events occur at the same timestamp, instead of tacking on a UUID to the timestamp, you may wish to use a ULID, or a universally unique lexicographically sortable identifier, which can generate up to 1.21e+24 unique identifiers per millisecond.
This allows the same example from above to be reduced to
getLatestComments() commentTimestamp#<comment_timestamp>#comment#<comment_id>
or
getLatestPurchases() purchaseTimestamp#<purchase_timestamp>#comment#<purchase_id>
Comparing ULIDs allows you to determine the creation time of each entity inserted into CanDB.
And Good News! Motoko already has a ready to use ULID library
B) Example: Designing a basic OpenChat application using CanDB
NOTE:
Every application has different requirements, and this is just one approach to designing a data model for a chat application.
OpenChat is a multi-canister chat application on the IC that works very much like current web2 messaging apps, allowing DM and group conversations.
OpenChat solves the issue of scalability by assigning a unique canister for each user.
NOTE:
Keep in mind that for other applications, it may not make sense to assign a partition key for each user, but instead to assign a partition key for every 100 users, 1000 users, or all users. This all depends on your data requirements and how quickly you expect your canister memory to scale and grow.
Let’s look at how we can design a simple CanDB NoSQL data schema that will support the following APIs:
- createChat() Allows a user to create a new chat (group or DM)
- sendMessage() - sends the message to the other user(s) in the chat
- getMessagesInChat() - gets the latest n messages in a specific user’s chat
- getChatsForUser() - gets all chats for a specific user
To replicate the canister-per-user approach that OpenChat has, each user should be assigned a unique PK. We can do this by setting the PK to user#<user_id>
Also, in order to build a messaging application, we first need to be able to create chats and have a location where all chat metadata, such as an allowlist of users, the owner/creator of the chat, etc. is stored. For this application example, we’ll make the allowlist/metadata functionality separate from the user canister by creating a chats PK where all chat metadata will be stored
Within each partition, we can then use the SK to quickly provide different data access patterns for our chat application. We’ll now go through how we can choose the right SK to support the necessary access patterns for each of our APIs:
1. getChatsForUser()
PK = user#<user_id> SK (bounded), from chat# to chat#~
Any data that is hierarchical (i.e. one-to-many) works very well with CanDB. For example, a user can have many chats, so we can represent each chat with a SK of chat#<chat_id>. For Attributes, a map of chatId to chat metadata is stored, such as the chatName (display name), and the chatType (group/dm).
This allows us to use the user#<user_id> PK and CanDB’s scan functionality to paginate in ascending or descending order through all of the chats. To do this, we use the power of sort keys and set a skLowerBound of chat#, and an skUpperBound of chat#~.
2. getMessagesInChat()
PK = user#<user_id> SK (bounded), from chats#<chat_id>#message# to chat#<chat_id>#message#~
Messages within a chat is also a hierarchical relationship (a chat can have many messages) so we can represent each message by using a SK of chats#<chat_id>#message#<message_id>. Notice how the beginning of this prefix is plural (chats#...), and not the singular (chat#...) version we are using to paginate through a user’s chats in the getChatsForUser()` method
This allows us to use the user#<user_id> PK and CanDB’s scan functionality to paginate through all of the messages in a chat by setting SK bounds, with a skLowerBound of chats#<chat_id>#message#, and a skUpperBound of chats#<chat_id>#message#.
3. createChat()
PK = chats SK = chat#<chat_id> Attributes = chatID, chatType, chatAllowList, chatOwner
CanDB.put() can be used every time a new chat is created or updated, providing the chats PK and the chat#<chat_id> SK. Chat metadata Attributes can be provided with chat creation or updatee such as the chatID, chatType, chatAllowList, and chatOwner. Likewise, queries can use CanDB.get() to access this same chat metadata through this same PK and SK
Once a chat is created, a secondary action can be for each of the user ids in the chatAllowList to use CanDB.put() to create a chat#<chat_id> SK in each of their user partitions with the user#<user_id> PK, or at least to add a “pending request” invitation to this chat.
4. sendMessage()
PK = user#<user_id> SK = chats#<chat_id>#message#<message_id> Attributes = messageId, messageBody, messageMetadata
First, in order to send a message to all members of a chat, the chatAllowList in the chats PK chat#<chat_id> SK can be queried with CanDB.get() in order to retrieve the specific chat’s user list.
Then, for each of those users an update call can be made to CanDB.put() to insert the message using the user’s user#<user_id> PK and chats#<chat_id>#message#<message_id> SK with Attributes of messageId, messageBody, and messageMetadata.
C) Example Schema Diagram
(built using NoSQL Workbench)
Note: In this example, numerical SK identifiers are used for readability. Instead, it’s recommended that inside sort keys you use Lexicographic integer encoding of numeric IDs
D) Resources
- Video: Data Modeling with DynamoDB
- Tool to Visualize your Data Model: NoSQL Workbench
如何使用 CanDB 为数据建模