Data Structures

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How can data structures simplify codes?

Objectives
  • Understand why different data structures exist

  • Recall common data structures

  • Recognize where and when to use them

What is a data structure?

data structure: represents information and how that information can be accessed

Choosing the right representation for your data can vastly simplify your code and increase performance.

Choosing the wrong representation can melt your brain. Slowly.

Examples:

The number 2 is represented with the integer 2

>>> type(2)
int

Acceptable behaviors for integers include +, -, *, /

>>> 1 + 2
3
>>> 1 - 2
-1

On the other hand, text is represented by a string

>>> type("text")
str

It does not accept all the same behaviours as an integer:

>>> "a" + "b"
"ab"
>>> "a" * "b"
TypeError: can't multiply sequence by non-int of type 'str'

Integers can be represented as strings, but their behaviours would be unexpected:

>>> "1" + "2"
"12"

With integers, + is an addition, for strings it’s a concatenation.

Health impact of choosing the wrong data structure

Stay healthy. Stop and choose the right data structure for you!

Basic data structures

Lists

Lists are containers of other data:

# List of integers
[1, 2, 3]
# List of strings
["a", "b", "b"]
# List of lists of integers
[[1, 2], [2, 3]]

Lists make sense when:

Beware! The following might indicate a list is the wrong data structure:

Other languages

  • C++:
    • std::vector, fast read-write to element i and fast iteration operations. Slow insert operation.
    • std::list. No direct access to element i. Fast insert, append, splice operations.
  • R: list
  • Julia: Array, also equivalent to numpy arrays.
  • Fortran: array, closer to numpy arrays than python lists

Tuples

Tuples are short and immutable containers of other data.

(1, 2)
("a", b")

Immutable means once that once created, elements cannot be added, removed or replaced:

>>> shape = 2, 4
>>> shape
(2, 4)
>>> shape[1] = 4
TypeError: 'tuple' object does not support item assignment

Modifying a tuple vs modifying the element of a tuple

>>> something = ["a", "b"], 4
>>> something[0].append("c")
>>> something
(['a', 'b', 'c'], 4)
>>> something[0] = ["a", "b", "c"]
TypeError: 'tuple' object does not support item assignment

The tuple itself cannot be modified, but its elements can be if they themselves are mutable. The container is immutable, but the contents might not be.

Tuple make sense when:

Beware! The following might indicate a tuple is the wrong data structure:

Other languages

Sets

Sets are containers where each element is unique:

>>> set([1, 2, 2, 3, 3])
{1, 2, 3}

They make sense when:

Other languages

Dictionaries

Dictionaries are mappings between a key and a value (e.g. a word and its definition).

# mapping of animals to legs
{"horse": 4, "kangaroo": 2, "millipede": 1000}

They make sense when:

Beware! The following might indicate a dict is the wrong data structure:

Other languages

Advanced data structures

Custom data structures: Data classes

Python (>= 3.7) makes it easy to create custom data structures.

>>> from typing import List, Text
>>> from dataclasses import dataclass
>>> @dataclass
... class MyData:
...   a_list: List[int]
...   b_string: Text = "something something"
>>> data = MyData([1, 2])
>>> data
MyData(a_list=[1, 2], b_string='something something')
>>> data.a_list
[1, 2]

Data classes make sense when:

Beware! The following might indicate a dataclass is the wrong data structure:

Exploring data structures

In your own time, find out all the thing data structures can do for you:

As well as other standard data structures:

All modern languages will have equivalents, outside of “Modern Fortran”.

Don’t reinvent the square wheel.

Digital Oxford Dictionary, the wrong way and the right way

  1. Implement an oxford dictionary with two lists, one for words, one for definitions:

     barf: (verb) eject the contents of the stomach through the mouth
     morph: (verb) change shape as via computer animation
     scarf: (noun) a garment worn around the head or neck or shoulders for warmth
       or decoration
     snarf: (verb) make off with belongings of others
     sound: |
       (verb) emit or cause to emit sound.
       (noun) vibrations that travel through the air or another medium
     surf: |
       (verb) switch channels, on television
       (noun) waves breaking on the shore
    
  2. Given a word, find and modify its definition
  3. Do the same with a dict
  4. Create a subset dictionary (including definitions) of words rhyming with “arf” using either the two-list or the dict implementation
  5. If now we want to also encode “noun” and “verb”, what data structure could we use?
  6. What about when there are multiple meanings for a verb or a noun?

Dictionary implemented with lists

from typing import List, Text, Tuple


def modify_definition(
    word: Text, newdef: Text, words: List[Text], definitions: List[Text]
) -> List[Text]:
    from copy import copy

    index = words.index(word)
    definitions = copy(definitions)
    definitions[index] = newdef
    return definitions


def find_rhymes(
    rhyme: Text, words: List[Text], definitions: List[Text]
) -> Tuple[List[Text], List[Text]]:
    result_words = []
    result_definitions = []
    for word, definition in zip(words, definitions):
        if word.endswith(rhyme):
            result_words.append(word)
            result_definitions.append(definition)
    return result_words, result_definitions


def testme():

    words = ["barf", "morph", "scarf", "snarf", "sound", "surf"]
    definitions = [
        "(verb) eject the contents of the stomach through the mouth",
        "(verb) change shape as via computer animation",
        (
            "(noun) a garment worn around the head or neck or shoulders for"
            "warmth or decoration"
        ),
        "(verb) make off with belongings of others",
        (
            "(verb) emit or cause to emit sound."
            "(noun) vibrations that travel through the air or another medium"
        ),
        (
            "(verb) switch channels, on television"
            "(noun) waves breaking on the shore"
        ),
    ]

    newdefs = modify_definition("morph", "aaa", words, definitions)
    assert newdefs[1] == "aaa"

    rhymers = find_rhymes("arf", words, definitions)
    assert set(rhymers[0]) == {"barf", "scarf", "snarf"}
    assert rhymers[1][0] == definitions[0]
    assert rhymers[1][1] == definitions[2]
    assert rhymers[1][2] == definitions[3]


if __name__ == "__main__":

    # this is one way to include tests.
    # the second session will introduce a better way.
    testme()

Dictionary implemented with a dictionary

from typing import List, Text, Tuple, Mapping


def modify_definition(
    word: Text, newdef: Text, dictionary: Mapping[Text, Text]
) -> List[Text]:
    from copy import copy

    result = copy(dictionary)
    result[word] = newdef
    return result


def find_rhymes(
    rhyme: Text, dictionary: Mapping[Text, Text]
) -> Tuple[List[Text], List[Text]]:
    return {
        word: definition
        for word, definition in dictionary.items()
        if word.endswith(rhyme)
    }


def testme():

    dictionary = {
        "barf": "(verb) eject the contents of the stomach through the mouth",
        "morph": "(verb) change shape as via computer animation",
        "scarf": (
            "(noun) a garment worn around the head or neck or shoulders for"
            "warmth or decoration"
        ),
        "snarf": "(verb) make off with belongings of others",
        "sound": (
            "(verb) emit or cause to emit sound."
            "(noun) vibrations that travel through the air or another medium"
        ),
        "surf": (
            "(verb) switch channels, on television"
            "(noun) waves breaking on the shore"
        ),
    }

    newdict = modify_definition("morph", "aaa", dictionary)
    assert newdict["morph"] == "aaa"

    rhymers = find_rhymes("arf", dictionary)
    assert set(rhymers) == {"barf", "scarf", "snarf"}
    for word in {"barf", "scarf", "snarf"}:
        assert rhymers[word] == dictionary[word]


if __name__ == "__main__":

    testme()

More complex data structures for more complex dictionary

There can be more than one good answer. It will depend on how the dictionary is meant to be used later throughout the code.

Below we show three possibilities. The first is more deeply nested. It groups all definitions together for a given word, whether that word is a noun or a verb. If more often than not, it does not matter so much what a word is, then it might be a good solution. The second example flatten the dictionary by making “surf” the noun different from “surf” the verb. As a result, it is easier to access a word with a specific semantic category, but more difficult to access all definitions irrespective of their semantic category.

One pleasing aspect of the second example is that together things that are unlikely to change one one side (word and semantic category), and a less precise datum on the other (definitions are more likely to be refined).

The third possibility is a pandas DataFrame with three columns. It’s best suited to big-data problems where individual words (rows) are seldom accessed one at a time. Instead, most operations are carried out over subsets of the dictionary.

from typing import Text
from enum import Enum, auto
from dataclasses import dataclass
from pandas import DataFrame, Series


class Category(Enum):
    NOUN = auto
    VERB = auto


@dataclass
class Definition:
    category: Text
    text: Text


first_example = {
    "barf": [
        Definition(
            Category.VERB, "eject the contents of the stomach through the mouth"
        )
    ],
    "morph": [
        Definition(
            Category.VERB, "(verb) change shape as via computer animation"
        )
    ],
    "scarf": [
        Definition(
            Category.NOUN,
            "a garment worn around the head or neck or shoulders for"
            "warmth or decoration",
        )
    ],
    "snarf": Definition(Category.VERB, "make off with belongings of others"),
    "sound": [
        Definition(Category.VERB, "emit or cause to emit sound."),
        Definition(
            Category.NOUN,
            "vibrations that travel through the air or another medium",
        ),
    ],
    "surf": [
        Definition(Category.VERB, "switch channels, on television"),
        Definition(Category.NOUN, "waves breaking on the shore"),
    ],
}


# frozen makes Word immutable (the same way a tuple is immutable)
# One practical consequence is that dataclass will make Word work as a
# dictionary key: Word is hashable
@dataclass(frozen=True)
class Word:
    word: Text
    category: Text


second_example = {
    Word(
        "barf", Category.VERB
    ): "eject the contents of the stomach through the mouth",
    Word("morph", Category.VERB): "change shape as via computer animation",
    Word("scarf", Category.NOUN): (
        "a garment worn around the head or neck or shoulders for"
        "warmth or decoration"
    ),
    Word("snarf", Category.VERB): "make off with belongings of others",
    Word("sound", Category.VERB): "emit or cause to emit sound.",
    Word(
        "sound", Category.NOUN
    ): "vibrations that travel through the air or another medium",
    Word("surf", Category.VERB): "switch channels, on television",
    Word("surf", Category.NOUN): "waves breaking on the shore",
}

# Do conda install pandas first!!!
import pandas as pd

third_example = pd.DataFrame(
    {
        "words": [
            "barf",
            "morph",
            "scarf",
            "snarf",
            "sound",
            "sound",
            "surf",
            "surf",
        ],
        "definitions": [
            "eject the contents of the stomach through the mouth",
            "change shape as via computer animation",
            (
                "a garment worn around the head or neck or shoulders for"
                "warmth or decoration"
            ),
            "make off with belongings of others",
            "emit or cause to emit sound.",
            "vibrations that travel through the air or another medium",
            "switch channels, on television",
            "waves breaking on the shore",
        ],
        "category": pd.Series(
            ["verb", "verb", "noun", "verb", "verb", "noun", "verb", "noun"],
            dtype="category",
        ),
    }
)

Key Points

  • Data structures are the representation of information the same way algorithms represent logic and workflows

  • Using the right data structure can vastly simplify code

  • Basic data structures include numbers, strings, tuples, lists, dictionaries and sets.

  • Advanced data structures include numpy array and panda dataframes

  • Classes are custom data structures