Data Aggregates Part 1

Introduction

In this module we will cover advanced operations with strings and lists, namely:

Advanced Strings - immutability, encodings, escape sequences, string comparisons, multi-line strings, slicing, copying, cloning, and other common string methods and functions
Advanced Lists - indexing, slicing, iterating, list comprehension, copying, cloning, and other common list methods and functions
Matrices and Cubes - lists within lists

The source code for this module may be found in the public GitHub repository for this course. Where code snippets are provided in this module, you are strongly encouraged to type and execute these Python statements in your own Jupyter notebook instance.

1. Advanced Strings

As introduced in Control and Evaluations Part 1, string literals are sequences of characters enclosed within either single or double quotes. Represented as a data structure, strings are in actual fact sequences of bytes representing unicode characters. Since they are sequences, this means that they can be accessed and operated upon similar to lists. However, unlike lists, strings in Python are immutable, meaning that once created they cannot be changed. Any operations on strings will result in the creation of a new string object. In Control and Evaluations Part 2, we explored basic operations with strings, including indexing and slicing, common string methods and escape characters. In this module, we will explore more advanced operations with strings including further string methods, encodings, string comparisons and common string functions.

1.1. Character Encodings

When we type characters on our keyboards, most of us do not give a moment's thought as to how computers and software applications translate the typing of a key to a representation on the screen that we can understand. Different characters represent different linguistic entities across different global human languages - for example they may represent alphanumeric characters, punctuation and other language-specific symbols - and are stored by our computers as bytes.

Character sets represent a specific grouping of characters. Common character sets include ISO-8859-1 (Latin 1) for Western European characters, ISO-8859-6 for Arabic characters, ASCII (American Standard Code for Information Interchange) that contains 128 characters (the numbers 0 - 9, upper and lower case English letters from A - Z, and special characters including common punctuation), and Unicode (designed as a universal character set defining all characters for the vast majority of living languages).

In the Unicode standard, every character is assigned its own unique code point, which is an integer from 0 to 1,114,111 (or 0x10FFFF in the hexadecimal number system). In the Unicode standard, a code point is written with the notation U+1F600 which maps to the character with value 0x1f600 in this case (or 128,512 using the decimal system, which incidentally is the code point for the 😀 grinning face emoticon). A Unicode string is therefore a sequence of code points stored in memory as 8-bit bytes, where a suitable encoding such as UTF-8 provides the mapping between the Unicode string and the sequence of bytes.

Character encodings provide a means for software applications, including everything from office software and websites to global instant messaging platforms and integrated development environments, to map between the bytes stored by your computer and the characters in a character set, thereby enabling us to understand the representation on our screens. UTF-8 is the name of an encoding for all characters in the Unicode standard, and is the default character encoding for modern websites and online applications using HTML5.

ASCII was the first character set used for communications between computers and the Internet, and is the basis for ISO-8859-1 and UTF-8, where UTF-8 is now the default for HTML5. To learn more about character encodings, please visit W3C.

In Python 2 the default encoding was ASCII. However as of Python 3 the default encoding has changed to UTF-8. This means that when a string is created in Python 3, it is stored as Unicode. The impact of this is that when creating a string in Python, you can include any Unicode character in the string literal (as of the latest Unicode 13.0.0 specification, there are 143,859 characters). You can even use Unicode characters when naming identifiers in Python 3, as demonstrated in the following examples.

# Include a Unicode character in a Python 3 string literal
hospital_in_french = 'hôpital'
hospital_in_japanese = '病院'
hospital_in_arabic = 'مستشفى'
print(f'The word for Hospital in French is: {hospital_in_french}')
print(f'The word for Hospital in Japanese is: {hospital_in_japanese}')
print(f'The word for Hospital in Arabic is: {hospital_in_arabic}')

# Include a Unicode character in a Python 3 identifier
日本の人口 = 126_500_000
print(f'The population of Japan is: {日本の人口}')

The Python encode() string method encodes a given string to a target encoding. If no target encoding is provided, the methods defaults to UTF-8. Furthermore, an optional error parameter may be provided that instructs Python how to deal with characters that could not be encoded using the target encoding, as follows:

errors=backslashreplace - replaces characters that cannot be encoded with an escape sequence
errors=ignore - ignores characters that cannot be encoded
errors=namereplace - replaces characters that cannot be encoded with a descriptive placeholder
errors=strict - raises an error (default)
errors=replace - replaces characters that cannot be encoded with a question mark
errors=xmlcharrefreplace - replaces characters that cannot be encoded with an XML character

The encode() string methods is demonstrated in the following example:

# Usage of the encode() string method
directions = 'The name of the hospital is Charité - Universitätsmedizin Berlin'
print(directions.encode(encoding="ascii", errors="backslashreplace"))
print(directions.encode(encoding="ascii", errors="ignore"))
print(directions.encode(encoding="ascii", errors="namereplace"))
print(directions.encode(encoding="ascii", errors="replace"))
print(directions.encode(encoding="ascii", errors="xmlcharrefreplace"))

The Python ascii() method can be used to return a readable ASCII-only version of any Python object, including strings and data collection structures such as lists, tuples and dictionaries. This is useful for when you want a printable representation of an object, where non-ASCII characters are replaced with an escape sequence, as follows:

# Usage of the ascii() function
names = ['Quddus', 'Gößmann', 'José']
print(ascii(names))

Finally, Python provides a convenient function called the ord() function that returns the integer Unicode code point value for any given character, as follows:

# Usage of the ord() function
dragon = "竜"
print(f'The Unicode code point value for {dragon} is {ord(dragon)}')

1.2. Escape Sequences

You may have noticed that the encode() string method and the ascii() function introduced above can replace characters with something called escape sequences. In this sub-section, we will discuss escape sequences in further detail, including what they are and a list of escape sequences available in Python 3.

As introduced in Control and Evaluations Part 2, in some cases the inclusion of specific characters in a string literal will cause Python to raise a SyntaxError exception - for example the existence of a double quote character in a string that itself is defined and enclosed by double quote characters. To overcome this and insert characters that would otherwise be illegal in a string, we may 'escape' those illegal characters by prefixing them with the \ backslash character as follows.

# Try to create a string containing illegal characters
illegal_string = "My name is "Jillur""

# Escape the illegal characters using the backslash character
legal_string = "My name is \"Jillur\""
print(legal_string)

Other escape characters in Python include:

\' - single quote
\" - double quote
\\ - backslash
\ - newline
\a - ASCII bell (BEL)
\b - ASCII backspace (BS)
\f - ASCII formfeed (FF)
\n - ASCII linefeed (LF)
\r - ASCII carriage return (CR)
\t - ASCII horizontal tab (TAB)
\v - ASCII vertical tab (VT)

In addition to the escape characters above, the following escape sequences are also available in Python 3 and refer to characters using their octal and hexadecimal code point values respectively.

\oooo - character with an octal value of oooo
\xNN - character with a hexadecimal value of NN
\N{name} - character named {name} in the Unicode standard
\uNNNN - character with the hexadecimal value NNNN

These escape characters and sequences are demonstrated in the following examples:

Note that Jupyter Notebook currently only supports a limited number of escape sequences. Therefore the examples below are demonstrated using the Python interpreter invoked by simply entering and executing the python command in your command line shell.

1.3. Immutable Strings

Now that we have an understanding of the Unicode standard and Unicode code points, we can update our definition of strings in Python. Officially, strings are immutable sequences of Unicode code points, meaning that once created strings cannot be changed. If we try to update a character in a string, a TypeError exception will be returned, as follows:

# Try to change a character in a string
test_string = 'Hello World'
print(test_string[6])
test_string[6] = "Q"

We can further verify the immutability of strings by using the Python id() function. This function returns the unique identifier that is assigned to all Python objects at the time that they are created. The unique identifier refers to the object's address in memory and will change each time a Python program is executed (therefore they should not be used as persistent keys). In the example below, we create two variables with the same literal value - in this case, both variables will point to the same string in memory, as evidenced by the value returned by the id() function.

This example works because when a new immutable object is created, such as a string, Python will check to see if an identical object already exists in memory as a shared object. This process is known as string interning, but is not guaranteed by Python and is generally applied to small strings only.

# Apply the id() function to strings
my_first_string = 'abracadabra'
my_second_string = 'abracadabra'
print(f'id(my_first_string) = {id(my_first_string)}')
print(f'id(my_second_string) = {id(my_second_string)}')

Furthermore, when we provide the id() function with a character, it will return the memory address of its Unicode code point stored as an 8-bit byte in memory. If we iterate over all the characters in a string, you will note that the id() function applied to the same character will return the same memory location.

# Apply the id() function to each character in a string
for idx in range(0, len(my_first_string)):
    print(f'{my_first_string[idx]} = {id(my_first_string[idx])}')

Finally, consider the following Python program. For beginner Python developers, it may seem that we have indeed changed the value of a string. However in this example, the only thing that we have done is pointed the variable my_string from one string to another string - the strings themselves have not changed (as they are immutable). What has changed however is what the variable my_string is pointed at and references.

# Strings are immutable, but variables can be changed to point to different things
my_string = 'I am a Data Scientist'
print(my_string)
my_string = 'I am a Software Engineer'
print(my_string)

1.4. Multi-Line Strings

String literals that are triple quoted (i.e. start and end with either three single quotes or three double quotes) may span multiple lines, where any whitespace characters are considered part of the string literal.

my_multiline_string = '''Line 1\tEOL 
Line 2\t\tEOL
Line 3    EOL'''
print(my_multiline_string)

1.5. String Comparisons

We now know that strings are in fact immutable sequences of Unicode code points. However this introduces a problem when comparing strings that contain characters that may be represented by two different sequences of code points. For example take the German character ö. This can either be represented as U+00F6 (string of length 1), or the code point for 'o' which is U+006F along with the code point for combining diaeresis which is U+0308 (resulting in a total string length of 2).

# Represent the German umlaut using two different code point sequences
umlaut_sequence_1 = '\u00F6'
umlaut_sequence_2 = '\u006F\u0308'
print(f'{umlaut_sequence_1} has length {len(umlaut_sequence_1)}')
print(f'{umlaut_sequence_2} has length {len(umlaut_sequence_2)}')

In this case, the Python standard library provides a unicodedata module containing a normalize() function that will convert such strings to one of several normal forms. In these normal forms, characters represented using a combining character, such as the combining diaeresis, will be converted into single character equivalents of length 1.

# Compare strings that contain different code point sequences
import unicodedata
print(umlaut_sequence_1 == umlaut_sequence_2)
print(unicodedata.normalize('NFD', umlaut_sequence_1) == unicodedata.normalize('NFD', umlaut_sequence_2))

The normalize() function takes as its first parameter the desired normalization form and can be either 'NFC', 'NFKC', 'NFD' or 'NFKD'. Please refer to the official documentation on the Unicodedata module for further information.

1.6. Advanced String Slicing

In the previous module, we studied how slicing notation can also be applied to strings in order to extract subsequences of characters. An additional third parameter may be provided to slicing notation that defines the stride size, that is the number of elements to move forwards once an element is extracted. By default the stride size is 1, however we may set it to any positive or negative integer value. A negative stride size instructs Python to process the object, in this case a string, in reverse order.

# Create a string
my_string = 'abracadabra'

# Extract characters of a string that have even indexes using the stride size argument
print(my_string[::2])

# Use a negative stride size argument to extract in reverse order
print(my_string[::-2])

# Extract a defined subset of characters with a stride size of 2 within that subset
print(my_string[4:10:2])

# Extract characters with even indexes starting from the 5th character
print(my_string[4::2])

1.7. Copying and Cloning Strings

Assigning a variable to a string object (or any object) in Python simply creates a binding between the variable and the object. In the following example, we assign the variables my_first_string and my_second_string to both point to the string with literal value abracadabra. At no point is a copy of the string created.

# Create two variables to point to the same string
my_first_string = 'abracadabra'
my_second_string = 'abracadabra'
print(f'id(my_first_string) = {id(my_first_string)}')
print(f'id(my_second_string) = {id(my_second_string)}')

In the event that we bind a variable to the results of slicing notation applied to a string, again the original string is not changed (as strings are immutable). Rather a copy of the original string and the required subset of characters is created, and the original string is not modified.

# Create a new variable bound to a subset of a string using slice notation
my_third_string = my_first_string[4:]
print(f'my_third_string = \'{my_third_string}\'')
print(f'id(my_third_string) = {id(my_third_string)}')

1.8. Common String Methods

As introduced in the previous module, Python provides the following common but non-exhaustive string methods.

str.lower() - returns the string all in lowercase
str.upper() - returns the string all in uppercase
str.strip() - returns the string with leading and trailing whitespace characters removed
str.replace('find', 'replacewith') - returns the string with all instances of the first substring replaced with the second substring
str.find('substring') - returns the starting index of the first instance of the given substring, else returns -1 if no match is found
str.startswith('substring') - tests whether the string starts with the given substring
str.endswith('substring') - tests whether the string ends with the given substring
str.split('delimiter') - returns a list collection of substrings by splitting the original string on a given delimiter. If no delimiter is provided, then the string is split on all whitespace characters
delimiter.join([list]) - returns a string that joins all of the elements in a given list collection together using the given string delimiter

# Create a new string literal
my_string = "    Hello! My name is Jillur Quddus and I am a Chief Data Scientist and Principal Polyglot Software Engineer.    "
print(my_string)

# str.lower()
print(my_string.lower())

# str.upper()
print(my_string.upper())

# str.strip()
stripped_string = my_string.strip()
print(stripped_string)

# str.replace('find', 'replacewith')
initialized_string = stripped_string.replace('Jillur', 'J').replace('Quddus', 'Q')
print(initialized_string)

# str.find('substring')
print(initialized_string.find('Data Scientist'))

# str.startswith('substring')
print(initialized_string.startswith('Hello!'))

# str.endswith('substring')
print(initialized_string.endswith("Good Bye!"))

# str.split('delimiter')
print(initialized_string.split('!'))

# delimiter.join([list])
print('|'.join(['Chief Data Scientist', 'Principal Polyglot Software Engineer', 'Technical Architect']))

Two further string methods of note are the isXXX() category of methods, and capitalize(). The isXXX() category of methods are a group of string methods commonly used to validate user input and which return a boolean value after a check is performed to test the nature of the string, a non-exhaustive list of which is provided as follows:

str.isupper() - returns True if a string of non-zero length is comprised of uppercase characters only, else returns False
str.islower() - returns True if a string of non-zero length is comprised of lowercase characters only, else returns False
str.isalpha() - returns True if a string of non-zero length is comprised of letters only, else returns False
str.isalnum() - returns True if a string of non-zero length is comprised of letters and numbers only, else returns False
str.isdecimal() - returns True if a string of non-zero length is comprised of numbers only, else returns False
str.isspace() - returns True if a string of non-zero length is comprised of space, tab and newline characters only, else returns False
str.istitle() - returns True if a string of non-zero length is comprised of words that all begin with an uppercase letter followed by lowercase letters, else returns False

# isupper()
print('MY NAME IS JILLUR'.isupper())

# islower()
print('my name is jillur Q'.islower())

# isalpha()
print('learning python'.isalpha())

# isalnum()
print('learning python 3'.isalnum())

# isdecimal()
print('01092020'.isdecimal())

# isspace()
print(' \t\n\t\v  '.isspace())

# istitle()
print('Jillur Quddus'.istitle())

Finally the Python capitalize() string method will return a string where the first character only is transformed into upper case.

# capitalize()
print('capitalize me'.capitalize())

1.9. Common String Functions

Python provides the following but non-exhaustive list of functions that may be applied to strings.

len(x) - returns the number of items in a given object. When that object is a string, the len() function will return the number of characters in that string
chr(x) - returns the character represented by the given integer representation of a valid Unicode code point
ord(x) - returns the integer Unicode code point of a given character

# len()
print(len('abracadabra'))

# chr()
print(chr(9786))

# ord()
print(ord('愛'))

2. Advanced Lists

As introduced in Control and Evaluations Part 2, a list in Python is an ordered and mutable (changeable) collection where duplicate elements (members/items) are allowed. Each element in a list is an object in its own right, whether a basic literal or a more complicated object.

2.1. Index and Negative Indexing

Each individual element in a list can be accessed by referencing its index number. The first element in a list has index number 0, the second element has index number 1 and so on. We may also access an element by its negative index number. The last element in a list has a negative indexing of -1, the penultimate element has a negative indexing of -2 and so on.

# Create a list containing the first ten square numbers
squares = [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
print(squares)

# Access elements in a list by their index numbers
print(squares[0])
print(squares[3])
print(squares[-1])
print(squares[-4])

2.2. Advanced List Slicing

In the previous module, we studied how slicing notation can be applied to lists to enable us to access and manipulate a specific subset of elements from a list in an efficient manner. An additional third parameter may be provided to slicing notation that defines the stride size, that is the number of elements to move forwards once an element is extracted. By default the stride size is 1, however we may set it to any positive or negative integer value. A negative stride size instructs Python to process the list elements in reverse order.

# Extract elements from a list that have even indexes using the stride size argument
print(squares[::2])

# Use a negative stride size argument to extract elements in reverse order
print(squares[::-2])

# Extract a defined subset of elements with a stride size of 2 within that subset
print(squares[4:8:2])

# Extract elements with even indexes starting from the 5th element
print(squares[4::2])

2.3. Advanced List Slice Assignment

Unlike strings, lists are mutable. This means that Python permits us to change, replace, resize and delete elements of the list in place (i.e. operations performed on a list will change the original list and not create a copy). We can use slice notation combined with the assignment operator to manipulate lists as follows:

To help understand the following examples, remember that Python data structures are zero-indexed meaning that the first element has an index number of 0. Also recall that when using slice notation, if an end index number is not provided then Python will return all elements including the last element. But if an end index number is provided then Python will return all elements up until but not including the element at that index number.

# Define a list of numbers
my_numbers = [1, 3, 6, 10, 15, 21, 28, 36, 45, 55]
print(f'My list of numbers (first 10 triangle numbers): {my_numbers}')

# Substitute a subset of the list with new elements
my_numbers[1:3] = [4, 9]
print(f'My list of numbers (substitutions): {my_numbers}')

# Replace every Nth element with a new value, where N = 3
my_numbers[::3] = [1, 16, 49, 100]
print(f'My list of numbers (n-th replacements): {my_numbers}')

# Replace and resize a subset of the list
my_numbers[4:] = [25, 36, 49, 64]
print(f'My list of numbers (first 8 square numbers): {my_numbers}')

# Delete every Nth element starting from the 2nd element and where N = 2
del my_numbers[1::2]
print(f'My list of numbers (every other square number): {my_numbers}')

2.4. Shallow and Deep List Copies

If we assign a variable to a subset of a list (or one of Python's other in-built mutable collection data structures including dictionaries and tuples) using slice notation or the Python list() function, then we are mapping that variable to a shallow copy of that list. This means that a new list collection object is created but the elements in the new object are actually references to those elements in the original list where they exist. In other words, no copies of the elements themselves are made in the new list, and the original list is left unmodified. This is evidenced in the following examples, where we create shallow copies of a list and then modify an element in the original to see if this change is reflected in the copies.

# Define a list of phrases
my_phrases = [['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]
print(f'Original list of phrases: {my_phrases}')

# Create a shallow copy of the original list using slice notation, plus append a new element
my_shallow_phrases = my_phrases[:2] + [['j', 'k', 'l']]
print(f'Shallow copy of my phrases: {my_shallow_phrases}')

# Alternatively you may use the list() function to create a shallow copy
my_other_shallow_phrases = list(my_phrases) + [['j', 'k', 'l']]
print(f'Another shallow copy of my phrases: {my_other_shallow_phrases}')

# Make a change to the original list of phrases only
my_phrases[0][0] = 'ALPHA'
my_phrases[0][1] = 'BETA'
my_phrases[0][2] = 'GAMMA'
print(f'\nUpdated original list of phrases: {my_phrases}')

# Examine the shallow copies of the original list
print(f'Shallow copy of my phrases: {my_shallow_phrases}')
print(f'Another shallow copy of my phrases: {my_other_shallow_phrases}')

In order to create a real copy, or clone, of the original collection, we can make a deep copy. A deep copy not only creates a new collection object, but also recursively creates copies of the elements from the original. In other words a full clone of the original object and all of its elements is created. If we were to change the original collection, these changes would not be reflected in its clones. Python provides the ability to make a deep copy via its copy standard library module and associated deepcopy() function, as follows:

# Create a deep copy of the original list
import copy

my_phrases = [['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]
my_deep_phrases = copy.deepcopy(my_phrases)
print(f'Original list of phrases: {my_phrases}')
print(f'Deep copy of my phrases: {my_deep_phrases}')

# Make a change to the original list of phrases only
my_phrases[0][0] = 'ALPHA'
my_phrases[0][1] = 'BETA'
my_phrases[0][2] = 'GAMMA'
print(f'\nUpdated original list of phrases: {my_phrases}')

# Examine the deep copy of the original list
print(f'Deep copy of my phrases: {my_deep_phrases}')

2.5. Iterating Lists

In the module Control and Evaluations Part 2, we introduced special control flow statements called loops that enable iteration over data structures or whilst a boolean condition is met. Both for and while loops can be applied to iterate over lists as follows:

The enumerate() function in the last example below enables us to access both the current index and value from iterating over an iterable object such as a list. Using this function, we can access both properties via a simple one-line for loop statement combined with the in operator.

# Create a list
my_diatomic_list = [0, 1, 1, 2, 1, 3, 2, 3, 1, 4, 3, 5, 2, 5, 3, 4]

# Iterate the list using a while loop
i = 0
while i < len(my_diatomic_list):
    print(f'Diatomic Sequence - Element #{i + 1}: {my_diatomic_list[i]}')
    i += 1

# Iterate the list using a for loop with the range function
for x in range(len(my_diatomic_list)):
    print(f'Diatomic Sequence - Element #{x + 1}: {my_diatomic_list[x]}')

# Iterate the list using a for loop with the in operator
for elem in my_diatomic_list:
    print(elem)

# Iterate the list using a for loop with the in operator and enumerate function
for idx, elem in enumerate(my_diatomic_list):
    print(f'Diatomic Sequence - Element #{idx + 1}: {elem}')

2.6. List Membership

The membership in and not in operators, as discussed in Control and Evaluations Part 1, allows us to test whether a given element can be found within a given list (or another type of collection).

# Test whether an element exists in a list
squares = [1, 4, 9, 16, 25]
x = 16
if x in squares:
    print(f"{x} exists in our list of square numbers")
else:
    print(f"{x} does not exist in our list of square numbers")

# Test whether an element is NOT in a list
y = 169
if y not in squares:
    print(f"{y} does not exist in our list of square numbers")
else:
    print(f"{y} exists in our list of square numbers")

2.7. List Comprehension

Python provides an elegant and intuitive means to create lists through list comprehension. List comprehension allows us to define a list in one statement, and is composed of three primary components:

Expression - any valid Python expression that returns a value, including invocation of methods and functions
Iterable object - any valid object that is iterable, such as a string, a collection, or the evaluation of the range() function
Member object - an object that will be bound to the current value in the iterable as it is being iterated

For example, the following Python statement creates a list of the first 8 square numbers. The expression is num * num, the iterable object is the evaluation of range(1, 9) (which evaluates to and returns the immutable sequence of numbers 1 - 8), and the member object is num which will iteratively take each value in the iterable object i.e. 1 - 8.

# Create a list of square numbers using list comprehension
squares = [num * num for num in range(1, 9)]
print(squares)

Another example is provided below. In this example we create a list whose elements are the characters extracted from a string. The expression is character, the iterable object is the immutable string 'abibliophobia', and the member object is character which will iteratively take each character in the string as it is being iterated.

# Create a list formed of characters from a string
letters_in_abibliophobia = [character for character in 'abibliophobia']
print(letters_in_abibliophobia)

We can also use conditional statements at the end or near the beginning of the expression in order to introduce conditional logic. Placing the conditional statement at the end of the expression has the effect of performing filtering. And placing the conditional statement near the beginning of the expression provides the ability to change the member object value based on the condition being met.

In the example below, we create a list of all the prime numbers between 1 and 100. In this example, the expression is i, the iterable object is the evaluation of range(1, 101), the member object is i, and the conditional statement that has the effect of filtering is if isPrime(i) which calls the isPrime() function.

# Create a list of all the prime numbers between 1 and 100
def isPrime(num):
    if num <= 1 or num % 1 > 0:
        return False
    for i in range(2, num//2):
        if num % i == 0:
            return False
    return True

prime_numbers = [i for i in range(1, 101) if isPrime(i)]
print(prime_numbers)

The final example below creates a list of all even numbers between 1 and 100, and replaces all the odd numbers with zero. In this example, the expression is i, the iterable object is the evaluation of range(1, 101), the member object is i, and the conditional statement that has the effect of determining the final value of the element in the list is if i % 2 == 0 else 0 which tests to see if the given number is even (in which case return it) or not (in which case return 0).

# Create a list of all the even numbers between 1 and 100, and replace the odd numbers with 0
even_numbers_zeroed_odd = [i if i % 2 == 0 else 0 for i in range(1, 101)]
print(even_numbers_zeroed_odd)

List comprehension is a simple, elegant and intuitive way to create lists using just one easy-to-understand Python statement which also provides the ability to perform mapping and filtering in the same statement. As such, they are a powerful tool in your development toolkit. In the next module, we will introduce similar comprehensions but for sets and dictionary data structures.

2.8. Common List Methods

The list collection in Python provides the following methods on the list object which can be used to access, insert, modify, copy and delete list elements.

list.append(element) - adds an element to the end of the list and in place
list.extend(newlist) - appends all the elements in the given new list to the original list and in place
list.insert(index, element) - inserts a given element at the given index number position and in place, shifting the latter elements to the right
list.remove(element) - removes the first element instance in the list whose value is equal to the given element, and in place. If no matching instance is found, then Python will raise a ValueError exception. This can be avoided by first using the membership operator in an if statement to test for existence before removal, or handling the exception.
list.pop([index]) - removes the element at the given index and returns it. The index number is optional, and if no index is given then the list.pop() method will simply remove the last element in the list and return it
list.clear() - removes all the elements from the list. This is equivalent to del list[:]
list.index(element[, start[, end]]) - searches for the first element instance in the list whose value is equal to the given element and returns its index number. f no matching instance is found, then Python will raise a ValueError exception. This can be avoided by first using the membership operator in an if statement to test for existence before removal, or handling the exception. Optionally we may provide slice index numbers to limit the search to a subset of elements.
list.count(element) - returns the number of element instances whose values are equal to the given element
list.sort() - sorts the elements in the list and in place
list.reverse() - reverses the elements in the list and in place
list.copy() - returns a copy of a list, the equivalent of list[:]. Note that by simply assigning a variable to the list object will NOT make a copy but will instead only make a reference to the original list object. This means that any changes made to the original list will be reflected in the new variable automatically.

# list.append(element)
squares.append(25)
print(squares)

# list.extend(newlist)
del more_squares[0]
squares.extend(more_squares)
print(squares)

# list.insert(index, element)
squares.insert(0, 0)
print(squares)

# list.remove(element)
squares.remove(0)
print(squares)

# list.pop([index])
squares.pop()
print(squares)

# list.clear()
squares.clear()
print(squares)

# Populate the list of square numbers again
squares.extend([1, 4, 9, 16, 25, 36, 49, 64, 81, 100])

# list.index(element[, start[, end]])
print(squares.index(64))

# list.count(element)
print(squares.count(169))

# list.sort()
squares.sort()
print(squares)

# list.reverse()
squares.reverse()
print(squares)

# list.copy()
squares_copy = squares.copy()
print(squares_copy)

2.9. Common List Functions

Python provides the following but non-exhaustive list of functions that may be applied to lists.

len(list) - returns the number of elements in a given list.
sorted(list, key, reverse) - returns a sorted list given a list. Optionally we can provide a key paramter that is a function to decide the order. We can also provide a reverse parameter which is a boolean, where False will sort the list in ascending order (default), and True will sort the list in descending order.
max(list) - returns the largest element in the list. If more than one element shares the maximum value, then only the first element is returned.
min(list) - returns the smallest element in the list. If more than one element shares the minimum value, then only the first element is returned.

# Create a list of numbers
my_numbers = [34, 13, 101, 4, 10, 24, 7]

# Create a list of letters
my_letters = ['j', 'z', 'e', 'f', 'x']

# len()
print(len(my_numbers))

# sorted()
print(sorted(my_numbers))
print(sorted(my_letters))

# max()
print(max(my_numbers))
print(max(my_letters))

# min()
print(min(my_numbers))
print(min(my_letters))

3. Lists in Lists

In mathematics, a matrix is an ordered rectangular array of numbers. Matrices can be added, substracted, multiplied and transposed (i.e. exchanging rows for columns), and are used to represent systems of linear equations and linear transformations. Matrices are one of the foundational data structures that underpins Linear Algebra - the fundamental mathematical language that powers all modern computing including data science and artificial intelligence, as well as applied physics, engineering and economics amongst other disciplines.

We explore matrices in further detail in our Linear Algebra course for data scientists. For the purposes of this module however, we define the dimensions of a matrix to be the number of rows and number of columns in that matrix, and in that order. And we define an element of a matrix to be a specific matrix entry. In the following image of a generalised matrix, its dimensions are m x n. And a₂₂ is an example of a specific element in that matrix.

In Python, there exists an industry-standard library called NumPy that provides out-of-the-box data structures and functions that are optimised to model, store, operate on and transform matrices (amongst other data structures) that can be applied to scientific computing applications including data science and artificial intelligence. We explore NumPy in further detail in our Python for Data Analysis and Linear Algebra courses for data scientists respectively.

3.1. Multi-Dimensional Arrays

For the purposes of this module however, a matrix can be represented easily as nested lists within lists, sometimes referred to as n-dimensional or multi-dimensional arrays. We can create multi-dimensional arrays or matrices using any of the list instantiation techniques introduced in the module.

3.1.1. Two-Dimensional Arrays

A simple method to create a two-dimensional array is to explicitly define a list of lists as follows:

# Create a two-dimensional array
my_2d_array = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
for row in my_2d_array:
    for elem in row:
        print(elem, end=' ')
    print()

Alternatively we can create a two-dimensional array using list comprehension. In the following example, we create a two-dimensional array of zeroes with dimensions 3 x 3:

# Create a two-dimensional array with list comprehension
number_cols = 3
number_rows = 3
my_2d_array = [[0 for i in range(number_cols)] for j in range(number_rows)]
for row in my_2d_array:
    for elem in row:
        print(elem, end=' ')
    print()

In order to access a specific element in a multi-dimensional array, we can use square brackets containing the indices of the element that we are interested in. For example we can access the elements in column 1 and row 1, column 2 and row 2, and column 3 and row 3 as follows (remembering that collections in Python are zero-indexed):

# Access elements using square brackets and indices
my_2d_array = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
print(my_2d_array[0][0])
print(my_2d_array[1][1])
print(my_2d_array[2][2])

Finally, since we can create multi-dimensional arrays with lists, any of the methods and functions discussed in this module that can be applied to lists can also be applied to the lists within our multi-dimensional arrays as follows:

# Create a two-dimensional array
my_2d_array = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

# Update a specific element in our two-dimensional array
my_2d_array[1][1] = 0

# Insert a new row in our two-dimensional array at a given row number using the insert method
my_2d_array.insert(2, [9, 9, 9])

# Append a new row to our two-dimensional array using the append method
my_2d_array.append([7, 4, 7])

# Append elements to an existing row in our two-dimensional array at a given row number using the extend method
my_2d_array[0].extend([0, 0])
my_2d_array[1].extend([0, 0])
my_2d_array[2].extend([0, 0])
my_2d_array[3].extend([0, 0])
my_2d_array[4].extend([0, 0])

# Reverse the order of the elements in a given row in our two-dimensional array using the reverse method
my_2d_array[2].reverse()

# Delete a row at a given row number from our two-dimensional array
del my_2d_array[3]

# Display out final two-dimensional array
for row in my_2d_array:
    for elem in row:
        print(elem, end=' ')
    print()

3.1.2. Three-Dimensional Arrays

When creating multi-dimenionsal arrays in any programming language, when the number of dimensions N is greater than 2 then it becomes significantly more difficult to conceptualise and visualise the array in your mind. NumPy provides functions that are optimised in handling multi-dimensional arrays and is strongly recommended for any non-trivial applications that require the handling and processing of multi-dimensional data.

However for the purposes of this sub-section, we will continue to use the Python standard library to create multi-dimensional arrays where N is 3, referred to as three-dimensional arrays or cubes. Three-dimensional arrays can be created in Python just as you would create two-dimensional arrays, either explicitly as lists within lists within a list, or using list comprehension as follows:

In the example below we import a module from the Python standard library called pprint. This module enables us to 'pretty-print' arbitrary data structures in Python with a focus on readability.

import pprint

# Create a three-dimensional array of size 3 x 3 x 3
my_3d_array = [[[x + y + z for x in range(3)] for y in range(3)] for z in range(3)]
pprint.pprint(my_3d_array)

# Access a list element in our three-dimensional array
print(my_3d_array[1][2])

# Access a specific element in our three-dimensional array 
print(my_3d_array[2][0][1])

4. Comparing Sequences

Strings and lists are both examples of sequence object types, where strings are immutable sequences of Unicode code points and lists are ordered and mutable collections of elements. It is possible to compare a sequence object with another sequence object of the same type using the comparison operators available in Python. Python achieves this by applying a determinstic lexicographical ordering algorithm, where the elements in both sequences are iteratively and recursively compared against each other - if they differ then this determines the outcome of the comparison. If, when the ends of both sequences are reached, all the items are equal then the sequence objects are considered equal. In the case of string objects, the Unicode code point number is compared for each individual character, as follows:

# Compare two given strings using comparison operators
print("Hello" > "Hello")
print("Hello" < "Hello")
print("abracadabra" > "Hello")
print("abracadabra" > "Hello")
print('a' < 'b' < 'c')

# Compare two given lists using comparison operators
print([1, 2, 3] > [1, 2, 3])
print([1, 2, 3] < [1, 2, 3])
print([1, 2, 3] > [1, 2, 4])
print([1, 2, 3] < [1, 2, 4])

Summary

In this module we have covered advanced operations using strings and lists in Python. We now have the ability to encode strings to a target encoding, and to compare strings that contain the same character represented by two different sequences of code points. We have also gained an advanced understanding of escape characters, advanced string slicing techniques, and knowledge of common string methods and functions. We are now able to apply advanced slicing techniques to lists in Python, as well as the ability to make shallow and deep copies of lists, and to create lists using list comprehension. Finally we have gained an advanced knowledge of common list methods and functions, and how lists may be used to create and manipulate multi-dimensional arrays and matrices.

Homework

Please write Python programs for the following exercises. There may be many ways to solve the challenges below - first focus on writing a working Python program, and thereafter look for ways to make it more efficient using the techniques discussed in this module.

List Comprehension - No Vowels Allowed
Write a Python program that uses list comprehension to return all the characters in a given string that are NOT vowels. For example your Python program should return the following list given the string 'Jillur': ['J', 'l', 'l', 'r']
List Comprehension - No Large Words Allowed
Write a Python program that uses list comprehension to return all the words in a given string that are less than 5 characters in length. For example your Python program should return the following list given the string 'Python Java C++ Go Scala Kotlin': ['Java', 'C++', 'Go']
List Comprehension - Factorials
Write a Python program that uses list comprehension to create a list of the first 10 factorial numbers. As such, your Python program should print: [1, 2, 6, 24, 120, 720, 5040, 40320, 362880, 3628800].
Two-Dimensional Array - Numbers
Write a Python program that uses list comprehension to create the following two-dimensional array of dimensions 3 x 3:

Two-Dimensional Array - Multiplication Table
Write a Python program that uses list comprehension to create the following two-dimensional array of dimensions 5 x 5, where each row/column contains the first 5 entries of the times tables for that row/column number (starting from 1):

A 5 x 5 matrix of multiplication table numbers

What's Next?

In the next module, we will cover other common collection data structures available in Python including tuples and dictionaries, and associated common methods and functions.

Jillur Quddus

Founder & Chief Data Scientist

Curriculum

Course Module • Jillur Quddus

4. Data Aggregates Part 1

Introduction to Python

Back to Course Overview

Introduction

1. Advanced Strings

2. Advanced Lists

3. Lists in Lists

4. Comparing Sequences

Summary

Homework

What's Next?

Jillur Quddus

About

Services

Knowledge Base

Our Company

Our People

Our Blog

Featured Case Studies

Open Source Projects

Giving Back

Course Module • Jillur Quddus

4. Data Aggregates Part 1

Introduction to Python

Back to Course Overview

Introduction

1. Advanced Strings

2. Advanced Lists

3. Lists in Lists

4. Comparing Sequences

Summary

Homework

What's Next?

Jillur Quddus