How to store your books
Published on Saturday, 13. June 2020Think of your PC's working memory as a giant storage facility for books. Each book is stored in a box that as a fixed place in the facility. All boxes are numbered consecutively. This way, you can refer to each stored book with a single number.
Most people that use this facility use it to share access to their books. Usually, they don't want to share a single book, but many. To make sharing as easy as possible, they want to use only a single number for their complete collection, refer to only one box.
To achieve this, different people use different approaches, that fit their needs best.
One of the users of the facility is Arnold. He uses it to save and share his ten favorite books of all time. He found a simple way to do this. When he first started using the facility, he looked for 10 consecutive unused boxes. Then he put each of his books in these.
Now all he has to remember is the number for the first box. Let's assume it to be 146. If Arnold wants to look at the third book in his collection, he immediately knows, that he will find it in box number 148. This system works perfectly for Arnold, and he recommends it to Liza.
Liza is a big fan of an ongoing series of Sci-Fi novels. For this series, about 2-3 novels are published each year and Liza wants to keep all of them in the facility. Because of this, Arnold's system doesn't work for her. The reason why Arnold can simply rent a certain space is that the number of books he has is limited. If he wanted to store another book, he would remove one for it.
But Liza's collection is constantly growing. She can't just put a new novel in the box directly after her collection, because it might already be rented. She would have to look for a free space that fits her collection including the new novel and move all books there. She doesn't want to do that.
Instead, she decides to rent a new box for each novel. But instead of memorizing all boxes, she adds a note in each box, that says where the next book is stored. So now, whenever a new novel is released, she can simply store the book in a new box. Then she finds the previous box by looking at her first box to find the second to find the third, and so on until she found the last in which she adds a note with the number of the new box. Doing it like this involves a lot of walking inside the facility. To make this faster, the minds behind the facility have developed a transportation system (which may be the topic of another post).
But even with this transportation system, Liza has to look at each box to store a new book. She also can't just go to the third book by calculating an offset. This isn't a problem for her. After all, she collects a series of novels. And if she ever decides to reread them again, she will start with the first book again.
Marian is a friend of Liza. She is a bookseller, which means she handles a lot of books. Most of them she doesn't keep for long, though. This means she doesn't have a problem with space limitations. The books move so fast through her hands, that she never handles more than 1000 books at a time. What is important to her is finding a specific book as fast as possible. Because of the restrictions of the facility, implementing a system to this is slightly more challenging.
To have an overview of her books in a usual book store she might have a different section for each genre and then sort the books by the author and title. When adding a new book, she would simply push the other books around until she had a gap big enough and put it between them. This doesn't work in this book storing facility. To put a book in box 183, she would have to move the book from box 183 to box 184, and the book from 184 to 185, and so on until she ends up at an empty box. If Marian would buy Alice's Adventures in Wonderland (by Lewis Carroll), she might need to move over 100 books, which would take too much of her time.
Her ideal system looks similar to Arnold's in that she would like a possibility to calculate the position of a book directly. Just as Arnold knows the box of his third favorite book, Marian would want to find a book without looking at any other box.
The approach she starts with is simple. To find a number for each book, she decides on a numerical representation for each letter (e.g. by using this table). Then she adds up the numerical representations for each letter of the book title to come up with a number for each book. Doing this means that to get the same number for each book, she always has to write it in the exact same way. It doesn't matter what conventions she follows, but it is important that she defines a set of conventions. In her case, she uses the last name of the author and the title of the book but ignores the subtitle. Furthermore, she writes all words in lowercase, separates the title and the name with a colon and the words of the title with a hyphen. For example, The art of learning - An inner Journey to optimal performance becomes "waitzkin:the-art-of-learning". The sum for this book is 2783.
Next, she divides the sum of all these numbers by the number of total boxes she uses. The remainder of this division becomes the box position for the book. For the Art of learning this means that the book will be placed in box 783.
This approach is a great start, but it is not enough. Marian wants to be able to store her best sellers more than once. Also, there are a lot more books out there than she uses boxes. This means that, for example, the books with the sum 712342 and 1342 would both be placed in the same box. So instead of using the sum as the final box position, she uses it as a starting place to find an empty box. If she wants to store the Art of Learning, but there's already a book in box 783, she puts it in box 784. Using this approach, she doesn't have to look through her whole collection to check if she has a book in stock. She only has to calculate the starting position and look from there until she finds either the book or an empty box. If she finds an empty box, this means she doesn't have the book in stock.
The effectiveness of this approach depends on the frequency with which different books are assigned to the same position. Think about what would happen if all books are assigned the same position. Marian would have to look through all books in her collection, instead of finding each book immediately. The goal is that these collisions happen as little as possible. Unfortunately, simply adding up the numerical representations does a terrible job at this.
After some experimentation, Marian finds a better way to calculate the sum. She goes through the letters one by one and before adding the number for the current letter, she multiplies her previous result by 33. Using the table linked above, the word "fly" becomes (102*33 + 108) * 33 + 121 = 114763. (f has the numerical value 102, l translates to 108, and y to 121)
What you've just read were three use-cases to demonstrate how to use storage with the same properties in different ways. The idea behind data structures is to use it in a way that best fits your needs.
This will always imply a trade-off. For example, Marian's approach is great if you want to store, remove, and find books fast. But if your goal is to find the book with the oldest publication date, it will take longer than for the other approaches. In all approaches, you have to look at all the books, but here, you also have a lot of empty boxes between the books.