3.5. Arrays#

Amanda R. Kube Jotte

An array is a data structure that stores a collection of elements, usually numbers. Arrays can look similar to lists, but they are designed for numerical work. All elements in an array must be the same type, which makes arrays faster and more efficient for calculations than lists.

In Python, arrays are provided by the numpy library. Arrays can be one-dimensional (like a simple list) or multi-dimensional (like a table or grid). In this section, we will explore how to create arrays and use some common functions with them.

Creating Arrays#

The first thing we need to do when working with arrays is import numpy. If this causes an error, this means that the numpy library has not been installed in the version of python that we’re running. In this case, we need to install numpy first.

This makes all of the objects, functions, and methods in the numpy library available to us in our current Python session.

import numpy

Recall from when we worked with the math module in Section 3.5, that we need to use the name of the library followed by a . to access its functionality. This . is called the dot operator. We also use the dot operator when calling methods on objects.

With numpy, we use the array() function to create new arrays. The array() function takes in a sequence of data, such as a list, and turns it into an array.

numpy.array([1, 2, 3])
array([1, 2, 3])

Note

Data scientists don’t usually type out the full name of the library numpy.
Instead, it is convention to abbreviate it as np. This practice is called aliasing.

We can create this alias by typing:

import numpy as np

Now Python knows that np refers to numpy, and we can rewrite our earlier code as:

np.array([1, 2, 3])

You can also use the functions np.zeros() and np.ones to create an array of zeros or ones respectively. The functions take in the number of values you want in your array as an argument.

np.zeros(4)
array([0., 0., 0., 0.])
np.ones(6)
array([1., 1., 1., 1., 1., 1.])

Sometimes we want to create arrays containing a long series of values. It can be time-consuming to type out every number by hand.

We can use the np.arange() function to generate an array of numbers with uniform spacing.

  • With one argument, arange(n) generates numbers starting at 0 and ending at n-1.

  • With two arguments, arange(start, stop) generates numbers starting at start and ending at stop - 1.

  • With three arguments, arange(start, stop, step) generates numbers starting at start, ending before stop, and spaced by step.

np.arange(5)
array([0, 1, 2, 3, 4])
np.arange(2, 7)
array([2, 3, 4, 5, 6])
np.arange(1, 10, 2)
array([1, 3, 5, 7, 9])
np.arange(1, 2, 0.1)
array([1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9])

Note

Python has a built-in function range that works similarly to np.arange, but it returns a list instead of a numpy array. np.arange can also use decimal (float) step sizes, which is not possible with plain range().

An Aside

You may notice that NumPy prints 1. instead of 1.0.

This is just a formatting choice. numpy shows numbers in a compact way, and it does not print the trailing zero after the decimal point if it’s not needed.

The value is still the floating-point number 1.0 in Python — it’s only being displayed as 1..

One of the biggest advantages of arrays is that they allow for convenient elementwise calculations. This means that when you perform a mathematical operation on arrays, the operation is applied to each element in turn.

For example, multiplying two arrays of the same length multiplies each pair of elements:

a = np.array([1, 2, 3])
b = np.array([10, 20, 30])

a * b
array([10, 40, 90])

Note

This behavior is different from lists. If you multiply two lists, Python will give you an error. If you multiply a list by a number, it repeats the list instead of doing elementwise multiplication:

[1, 2, 3] * 3   # [1, 2, 3, 1, 2, 3, 1, 2, 3]

You can add, subtract, or divide arrays similarly:

b - a + b
array([19, 38, 57])
a / b
array([0.1, 0.1, 0.1])

Multidimensional Arrays and Indexing#

When performing calculations with multiple arrays, the shapes (dimensions) of the arrays must be compatible.

You can check the shape of an array using the .shape attribute. An attribute is information that belongs to an object.

Just like we use the dot operator . to call a method (a function that belongs to an object), we also use it to access attributes (data that belongs to an object). But, unlike when we use methods, we do not include parentheses () after an attribute.

For example, we can the .shape attribute to tells us the dimensions of an array:

my_array = np.array([5, 7, 9, 11])
my_array.shape
(4,)

Warning

Many students, understandably, confuse functions, methods, and attributes.

Remember:

  • Functions are reusable blocks of code that perform a task. Sometimes functions are contained in libraries so we need to use the dot operator to access them: np.arange().

  • Methods are functions that belong to a type of object and act on those objects. We also need the dot operator when using methods, but it comes after an object not a library: list.append()

  • Attributes are information about an object. They, like methods, belong to the object so we need the dot operator to access them. However, they do not act on the object, so we don’t need to use parentheses: array.shape.

The shape attribute of an array is a tuple containing the dimensions of the array. The code above produced the tuple (4,) which tells us that there are 4 elements in the array. As there is only one element in the tuple, we also know that this is a one-dimensional (1d) array. You can think of it as a single row or a simple line of numbers.

We can also make arrays with more than one dimension. Below, is a two-dimensional array make by providing np.array() with a list of lists.

dd = np.array([[1, 2, 3, 4],
              [5, 6, 7, 8],
              [9, 10, 11, 12]])
print(dd)
dd.shape
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
(3, 4)

Now the shape tuple contains two numbers, (3, 4) for 3 rows and 4 columns. You can think of a 2d array like a table or spreadsheet.

Note

In the code above, we used print() to display an array for the first time. When you evaluate an array directly in the console (ie when you print it by placing it at the last line of a code cell), Python shows it with the array() wrapper to indicate its type.

As we saw in the previous chapter, the print() function formats its input in a more natural, human-readable way. For arrays, this means the data is displayed cleanly without the array() wrapper.

Higher dimensions are possible, too. A (2, 3, 4) array would be three-dimensional (3D) — you can think of it as a stack of 2 separate 3×4 tables.

ddd = np.array([[[1, 2, 3, 4],
              [5, 6, 7, 8],
              [9, 10, 11, 12]],
              [[13, 14, 15, 16],
              [17, 18, 19, 20],
              [21, 22, 23, 24]]])
print(ddd)
ddd.shape
[[[ 1  2  3  4]
  [ 5  6  7  8]
  [ 9 10 11 12]]

 [[13 14 15 16]
  [17 18 19 20]
  [21 22 23 24]]]
(2, 3, 4)

Recall that np.zeros() and no.ones() were used earlier to create 1d arrays of zeros and ones respectively. These can also be used to create multidimensional arrays by providing the desired shape as a tuple.

np.zeros((6,7))
array([[0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.]])
np.ones((2,2,2))
array([[[1., 1.],
        [1., 1.]],

       [[1., 1.],
        [1., 1.]]])

We can retrieve individual elements of arrays by indexing. For 1d arrays, indexing works in the same way it works for lists.

print(my_array)
my_array[2]
[ 5  7  9 11]
np.int64(9)
my_array[1:3]
array([7, 9])

When working with a multidimensional array, the same indexing rules apply as with one-dimensional arrays. The difference is that we now have to provide an index for each dimension of the array.

You can think of a multidimensional array as an array of arrays. Each inner array has its own indices, and we use a comma-separated list of indices to locate a specific element.

For example, the following is a 2 dimensional array.

j = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
j
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Providing a single index chooses one of the inner arrays.

j[0]
array([1, 2, 3])

Providing a second index chooses an element within that array.

j[0,1]
np.int64(2)

We can also use slices, like we did with lists.

j[1, 1:3]
array([5, 6])

If we have a 3-dimensional array, we then need 3 indices. See if you can follow this example.

k = np.array([[[1, 2, 3, 4], [5, 6, 7, 8]],
              [[9, 10 , 11, 12], [13, 14, 15, 16]]])

print(k.shape)

k[1,0,1:2:]
(2, 2, 4)
array([10])

The first index, 1, selects the second element in the outer array: [[9, 10 , 11, 12], [13, 14, 15, 16]]. The second index, 0, selects the first element of that array: [9, 10 , 11, 12]. Then we slice the resulting array starting at index 1 and going through (but not including) the end of the array with a step of 2. This results in only 10 as we stop before reaching 12.

Broadcasting#

When you do mathematical operations on arrays, NumPy normally expects the arrays to have the same shape. But sometimes, the shapes are different. For example, we might add a scalar to a whole array.

Instead of giving an error, numpy uses a set of rules called broadcasting. Broadcasting is the process of automatically stretching/repeating one array so that its shape matches the other, allowing elementwise operations to work.

To decide if two arrays can be broadcast, numpy compares their shapes from right to left. For each dimension, one of these must be true:

  1. The sizes are the same, or

  2. One of the sizes is 1.

If all dimensions meet these conditions, the arrays are compatible.

For example, the arrays below have different shapes…

larger = np.array([1, 2, 3])
smaller = np.array([5])

print(larger.shape)
print(smaller.shape)

larger + smaller
(3,)
(1,)
array([6, 7, 8])

…but Python broadcasts the shorter array to make it the same size as the larger array. Essentially performing the following:

larger + np.array([5, 5, 5])
array([6, 7, 8])

This would also work if we used the scalar 5 instead of a single-element array.

larger + 5
array([6, 7, 8])

This allows us to do calculations like:

larger / 10 + 5
array([5.1, 5.2, 5.3])

The following would also work:

m = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

n = np.array([10, 20, 30])

print(m.shape, n.shape)

m + n
(3, 3) (3,)
array([[11, 22, 33],
       [14, 25, 36],
       [17, 28, 39]])

From right to left, we compare 3 and 1 then, 3 and 3. The first pair contains a 1, the second are exactly the same.

Not all shapes can be broadcast together.

x = np.array([1, 2, 3])
y = np.array([1, 2])

print(x.shape, y.shape)

x + y
(3,) (2,)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[30], line 6
      2 y = np.array([1, 2])
      4 print(x.shape, y.shape)
----> 6 x + y

ValueError: operands could not be broadcast together with shapes (3,) (2,) 

Above, we compare the sizes 1 and 1, then 3 and 2. The second pair is not a match and does not contain 1 so Python raises a ValueError.

Reshaping and Combining Arrays#

So far, we have worked with arrays in one shape at a time — a row of numbers, or a simple table. But in practice, data does not always arrive in the shape we want. Sometimes we need to reorganize it into rows and columns, or prepare it for broadcasting with another array.

Numpy has a reshape() method that lets us change the shape of an array, without changing the data itself. For example, the 1d array below…

flat = np.arange(1,101)
flat
array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100])

…can be reshaped to have 20 rows and 5 columns…

np.reshape(flat, (20,5))
array([[  1,   2,   3,   4,   5],
       [  6,   7,   8,   9,  10],
       [ 11,  12,  13,  14,  15],
       [ 16,  17,  18,  19,  20],
       [ 21,  22,  23,  24,  25],
       [ 26,  27,  28,  29,  30],
       [ 31,  32,  33,  34,  35],
       [ 36,  37,  38,  39,  40],
       [ 41,  42,  43,  44,  45],
       [ 46,  47,  48,  49,  50],
       [ 51,  52,  53,  54,  55],
       [ 56,  57,  58,  59,  60],
       [ 61,  62,  63,  64,  65],
       [ 66,  67,  68,  69,  70],
       [ 71,  72,  73,  74,  75],
       [ 76,  77,  78,  79,  80],
       [ 81,  82,  83,  84,  85],
       [ 86,  87,  88,  89,  90],
       [ 91,  92,  93,  94,  95],
       [ 96,  97,  98,  99, 100]])

…or to be a 10 by 10 grid.

np.reshape(flat, (10,10))
array([[  1,   2,   3,   4,   5,   6,   7,   8,   9,  10],
       [ 11,  12,  13,  14,  15,  16,  17,  18,  19,  20],
       [ 21,  22,  23,  24,  25,  26,  27,  28,  29,  30],
       [ 31,  32,  33,  34,  35,  36,  37,  38,  39,  40],
       [ 41,  42,  43,  44,  45,  46,  47,  48,  49,  50],
       [ 51,  52,  53,  54,  55,  56,  57,  58,  59,  60],
       [ 61,  62,  63,  64,  65,  66,  67,  68,  69,  70],
       [ 71,  72,  73,  74,  75,  76,  77,  78,  79,  80],
       [ 81,  82,  83,  84,  85,  86,  87,  88,  89,  90],
       [ 91,  92,  93,  94,  95,  96,  97,  98,  99, 100]])

Be careful that the dimensions you provide to reshape() match the size of the data you are reshaping. For example, I cannot fit 100 values into 10 rows and 9 columns…

np.reshape(flat, (10,9))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[36], line 1
----> 1 np.reshape(flat, (10,9))

File ~/Library/Python/3.9/lib/python/site-packages/numpy/core/fromnumeric.py:285, in reshape(a, newshape, order)
    200 @array_function_dispatch(_reshape_dispatcher)
    201 def reshape(a, newshape, order='C'):
    202     """
    203     Gives a new shape to an array without changing its data.
    204 
   (...)
    283            [5, 6]])
    284     """
--> 285     return _wrapfunc(a, 'reshape', newshape, order=order)

File ~/Library/Python/3.9/lib/python/site-packages/numpy/core/fromnumeric.py:59, in _wrapfunc(obj, method, *args, **kwds)
     56     return _wrapit(obj, method, *args, **kwds)
     58 try:
---> 59     return bound(*args, **kwds)
     60 except TypeError:
     61     # A TypeError occurs if the object does have such a method in its
     62     # class, but its signature is not identical to that of NumPy's. This
   (...)
     66     # Call _wrapit from within the except clause to ensure a potential
     67     # exception has a traceback chain.
     68     return _wrapit(obj, method, *args, **kwds)

ValueError: cannot reshape array of size 100 into shape (10,9)

…or into 12 rows and 12 columns.

np.reshape(flat, (12,12))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[37], line 1
----> 1 np.reshape(flat, (12,12))

File ~/Library/Python/3.9/lib/python/site-packages/numpy/core/fromnumeric.py:285, in reshape(a, newshape, order)
    200 @array_function_dispatch(_reshape_dispatcher)
    201 def reshape(a, newshape, order='C'):
    202     """
    203     Gives a new shape to an array without changing its data.
    204 
   (...)
    283            [5, 6]])
    284     """
--> 285     return _wrapfunc(a, 'reshape', newshape, order=order)

File ~/Library/Python/3.9/lib/python/site-packages/numpy/core/fromnumeric.py:59, in _wrapfunc(obj, method, *args, **kwds)
     56     return _wrapit(obj, method, *args, **kwds)
     58 try:
---> 59     return bound(*args, **kwds)
     60 except TypeError:
     61     # A TypeError occurs if the object does have such a method in its
     62     # class, but its signature is not identical to that of NumPy's. This
   (...)
     66     # Call _wrapit from within the except clause to ensure a potential
     67     # exception has a traceback chain.
     68     return _wrapit(obj, method, *args, **kwds)

ValueError: cannot reshape array of size 100 into shape (12,12)

We can also turn a multidimensional array to a 1d array using the array.flatten() method.

m = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
m.flatten()
array([1, 2, 3, 4, 5, 6, 7, 8, 9])

Not only can we reshape an array, but we can also concatenate arrays - meaning we can stack arrays together - using the functions np.row_stack() and np.column_stack(). These functions take 1d or 2d arrays as a tuple and combine them to create 2d arrays. Like with mathematical operations, arrays must have compatible shapes in order to stack them.

To use np.row_stack(), arrays must have the same number of rows.

one_array = np.array([7, 8, 9])
another_array = np.array([[9, 8, 7],
                          [9, 8, 7],
                          [9, 8, 7]])

print(one_array.shape, another_array.shape)

combined_row = np.row_stack((one_array, another_array))
combined_row
(3,) (3, 3)
array([[7, 8, 9],
       [9, 8, 7],
       [9, 8, 7],
       [9, 8, 7]])

To use np.column_stack(), arrays must have the same number of columns.

one_array_col = np.reshape(one_array, (3,1))

print(one_array_col)

combined_col = np.column_stack((one_array_col, another_array))
combined_col
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 one_array_col = np.reshape(one_array, (3,1))
      3 print(one_array_col)
      5 combined_col = np.column_stack((one_array_col, another_array))

NameError: name 'one_array' is not defined

Not only can we change the shape of an array, but we can also change the data type of its elements using the astype() method. This method works much like the type-casting functions you’ve already seen -— str(), bool(), int(), and float(). It acts on an array object and takes a data type as its argument. The result is a new array where every element has been converted to that type.

Just as with other type conversions, this will only succeed if Python knows how to make the conversion. For example, converting from numbers to strings will work, but trying to convert words like “hello” into integers will raise an error.

one_array.astype(str)
array(['7', '8', '9'], dtype='<U21')
not_castable = np.array(["blue","red","yellow"])
not_castable.astype(int)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[15], line 2
      1 not_castable = np.array(["blue","red","yellow"])
----> 2 not_castable.astype(int)

ValueError: invalid literal for int() with base 10: 'blue'

An Aside

When you see something like dtype='<U21' in an array, it’s numpy telling you the type of data stored in the array.

  • U means the array contains Unicode strings.

  • The number (21 in this case) means each string in the array can use up to 21 characters.

  • The < indicates the byte order, which matters for numeric types but not for strings (so you can usually ignore it here).

So dtype='<U21' simply means: “this is an array of Unicode strings, each with a maximum length of 21 characters.”

There are many other functions in numpy that we can use to manipulate arrays. Another useful one is np.sort(). Similar to the list.sort() method, this function sorts array elements. It takes an array as its first argument and allows you to specify the axis by which to sort. If no axis argument is given, the array will be flattened (converted to a 1d array) before sorting.

np.sort(not_castable)
array(['yellow', 'red', 'blue'], dtype='<U6')
scrambled = np.array([[5,9,0],[2,8,5],[7,6,3]])
np.sort(scrambled, axis = 1)
array([[0, 5, 9],
       [2, 5, 8],
       [3, 6, 7]])
np.sort(scrambled, axis = 0)
array([[2, 6, 0],
       [5, 8, 3],
       [7, 9, 5]])

Note

Unlike list.sort(), the np.sort() function only sorts in increasing order. If you want to sort in decreasing order, you need to reverse the order after sorting. One way of doing that is to “slice” the array from start to end but moving backwards. You can do this with the following code:

np.sort(not_castable)[::-1] #array(['yellow', 'red', 'blue'], dtype='<U6')

Try it on your own computer!

Mathematical Functions#

So far, we’ve seen that arrays can be reshaped and combined, and that NumPy lets us perform elementwise arithmetic. But, the real power of arrays, comes from the many mathematical functions available in numpy. Some especially useful aggregate functions include np.min(), np.max(), np.sum(), and np.average().

These can be applied to the entire array, or restricted to a particular axis (rows or columns) in multidimensional arrays.

Applying these functions to the whole array means that all values are included in the calculation, and the result is a single summary value.

print(combined_col)
np.min(combined_col)
[[7 9 8 7]
 [8 9 8 7]
 [9 9 8 7]]
7

To apply these functions across columns of the array, use an axis=0 argument. To apply them across rows, use an axis=1 argument. The returned array will be the same length as the number of columns or rows.

np.sum(combined_col, axis=0)
array([24, 27, 24, 21])
np.average(combined_col, axis=1)
array([7.75, 8.  , 8.25])

Not all numpy mathematical functions aggregate data. Some can be applied to transform elements within an array. Some include np.sqrt(), np.power(), and np.log().

For example, this code takes each element in the array to the third power.

np.power(np.array([1, 2, 3]), 3)
array([ 1,  8, 27])

You can find more information on these and many other mathematical functions in the numpy documentation.

This chapter covered 5 different ways of organizing data: lists, tuples, dictionaries, sets, and arrays. The following table summarizes them to help you compare, contrast, and decide which object to use when.

Type

Ordered

Mutable

Allows Duplicates

Indexed Access

Typical Uses

List

Yes

Yes

Yes

Yes

General-purpose collections where order matters

Tuple

Yes

No

Yes

Yes

Fixed collections, function returns

Dictionary

No

Yes

Keys: No
Values: Yes

By key

Lookups by name/label, storing mappings

Set

No

Yes

No

No

Membership tests, removing duplicates, set operations

Array (NumPy)

Yes

Yes

Yes

Yes

Efficient numerical operations