4.1 Data Wrangling and Advanced Indexing¶

Goal: Build data wrangling skills to clean and navigate datasets.

Outline:

Numpy Array and DataFrame methods and functions
Table indexing and slicing
Boolean (logical) indexing
Sorting
Data cleaning and inspection

Additional Required Reading: Functions¶

Data 8 textbook “Computational and Inferential Thinking: The Foundations of Data Science” By Ani Adhikari and John DeNero Chapter 8 Functions and Tables. This should overlap with your assigned reading for Data 8.

Numpy Arrays and DataFrames¶

NumPy and Pandas offer several types of data structures, the two main structures that we use are nparray and DataFrame. A nparray is a fast and flexible container for large datasets that allows you to perform operations on whole blocks of data at once. nparrays are best suited for homogenous (just one type) numerical data. DataFrames are designed for tabular datasets, and can handle heterogenous data (multiple types: int, float, string, etc.).

import numpy as np
import pandas as pd

nparray¶

Here is an example of a nparray with random float data:

# Generate a random nparray called arr_data
arr_data = np.random.randn(5,3)
arr_data

array([[-0.57998804,  0.97899567, -0.34593897],
       [ 0.72785995, -0.04733108, -0.78786345],
       [-1.28304881, -0.39334651, -0.70548535],
       [ 0.07545761, -1.38557708, -0.58136571],
       [-0.78433025,  1.77762604,  0.18444014]])

The function arrayName.shape is useful for finding the number of rows and columns in an array.

# use .shape to determine the shape of arr_data
arr_data.shape

(5, 3)

arrayName.dtype will display the data type of the data stored in the array.

# use .dtype to determine the type of arr_data
arr_data.dtype

dtype('float64')

The functions np.zeros and np.ones are similar, they create arrays full of zeros and ones respectively. The input for these functions is the shape of the array you want. This is an effective way of setting up an array of place-holders that you can then fill in with a loop or to make an array of a single value.

# Generate a nparray of zeros with np.zeros
arr0 = np.zeros((4,4))
arr0

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

# Generate a nparray of ones with np.ones
arr1 = np.ones((4,4))
arr1

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

# np.ones is handy for making a nparray of any single value
arr5 = arr1 * 5
arr5

array([[5., 5., 5., 5.],
       [5., 5., 5., 5.],
       [5., 5., 5., 5.],
       [5., 5., 5., 5.]])

The functions np.arange and np.linspace can be used to make monotonic number lines. They are really useful! np.arange makes an array of integers or floats between the starting and ending (note that the ending point is exclusive) in steps that you set as inputs. np.linspace will make evenly spaced float points between the starting and ending (note that the ending point is inclusive) you set.

# Generate an array of integers between 0 and 10 in steps of 1, including 0 (start) but not 11 (end)
arr2 = np.arange(0,11,1) 
arr2

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

# Generate an array of floats between 0 and 10 in steps of 2, including 0 (start) but not 11 (end)
arr2 = np.arange(0.0,11.0,2.0) 
arr2

array([ 0.,  2.,  4.,  6.,  8., 10.])

# Generate an array of 14 evenly spaced numbers between 0 and 10, including 0 (start) and 10 (end).
arr3=np.linspace(0,10,14) 
arr3

array([ 0.        ,  0.76923077,  1.53846154,  2.30769231,  3.07692308,
        3.84615385,  4.61538462,  5.38461538,  6.15384615,  6.92307692,
        7.69230769,  8.46153846,  9.23076923, 10.        ])

DataFrames¶

Series and DataFrames are like nparrays but they have the added feature of index labels assigned to each row and column – the bold labels in the below DataFrame. These labels can be used to bin and select data.

# generate a new DataFrame
# note that index values (like the column labels) don't have to integers and don't have to be in order
frame = pd.DataFrame(np.random.rand(3, 3), index=['Nevada','Montana','Arizona'], columns=['sedimentary','igneous','metamorphic'])
frame

	sedimentary	igneous	metamorphic
Nevada	0.080129	0.383692	0.369028
Montana	0.386677	0.934536	0.226119
Arizona	0.377186	0.122905	0.989476

We’ve seen DataFrame structures before in our tabular data files. The Earthquake catalog we were dealing with last week was a .csv (Comma Separated Variable) data file of all the earthquakes. We imported it as a DataFrame from the USGS API by setting up a query URL and using pd.read_csv. This time we’ll look at the earthquakes of magnitude 2.5 and greater from the past week.

start_day = '2020-09-07'
end_day = '2020-09-14'
standard_url = 'https://earthquake.usgs.gov/fdsnws/event/1/query?format=csv&orderby=magnitude'

query_url = standard_url + '&starttime=' + start_day + '&endtime=' + end_day + '&minmagnitude=2.5'
EQ_data = pd.read_csv(query_url)
EQ_data .head()

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	...	updated	place	type	horizontalError	depthError	magError	magNst	status	locationSource	magSource
0	2020-09-07T06:12:39.688Z	-17.1102	168.5034	10.0	6.2	mww	NaN	41.0	2.072	1.03	...	2020-11-29T09:47:05.444Z	72 km NNE of Port-Vila, Vanuatu	earthquake	6.9	1.8	0.058	29.0	reviewed	us	us
1	2020-09-11T07:35:57.187Z	-21.3968	-69.9096	51.0	6.2	mww	NaN	72.0	0.077	0.88	...	2020-11-21T20:10:36.040Z	82 km NNE of Tocopilla, Chile	earthquake	6.3	1.9	0.071	19.0	reviewed	us	us
2	2020-09-12T02:44:11.224Z	38.7482	142.2446	34.0	6.1	mww	NaN	47.0	2.232	0.97	...	2020-11-21T20:10:37.040Z	58 km SE of Ōfunato, Japan	earthquake	5.7	1.7	0.057	30.0	reviewed	us	us
3	2020-09-08T00:45:20.853Z	-4.8713	129.7548	172.0	5.9	mww	NaN	13.0	3.154	0.78	...	2020-11-17T19:44:53.040Z	193 km SSE of Amahai, Indonesia	earthquake	5.3	1.8	0.098	10.0	reviewed	us	us
4	2020-09-12T08:34:27.321Z	-17.2562	167.6792	10.0	5.9	mww	NaN	64.0	1.856	0.69	...	2020-11-21T20:10:38.040Z	85 km NW of Port-Vila, Vanuatu	earthquake	6.8	1.9	0.043	51.0	reviewed	us	us

5 rows × 22 columns

We have seen referencing individual columns (which are called Series) with: DataFrame['Column_Name'].

EQ_data['depth']

     10.00
     51.00
     34.00
    172.00
     10.00
        ...  
   36.40
    1.87
   25.60
    6.90
   24.04
Name: depth, Length: 451, dtype: float64

The .values function can be used to return the values of the Series as a nparray, so without the labled index values of the Series.

print(type(EQ_data['depth']))

<class 'pandas.core.series.Series'>

EQ_data['depth'].values

array([1.0000e+01, 5.1000e+01, 3.4000e+01, 1.7200e+02, 1.0000e+01,
       1.0000e+01, 1.7000e+01, 1.0000e+01, 5.5966e+02, 2.5000e+01,
       1.0000e+01, 3.0920e+01, 3.1061e+02, 1.3893e+02, 5.8030e+01,
       1.3380e+02, 1.0000e+01, 1.0000e+01, 5.7700e+01, 1.0000e+01,
       1.3161e+02, 1.0000e+01, 1.0000e+01, 3.8760e+01, 3.5000e+01,
       1.0000e+01, 1.2154e+02, 1.4900e+01, 1.0000e+01, 1.0125e+02,
       7.3140e+01, 1.0000e+01, 3.5000e+01, 1.0000e+01, 1.0000e+01,
       1.0000e+01, 1.0000e+01, 1.0000e+01, 6.0190e+01, 1.0956e+02,
       1.3670e+01, 4.1180e+01, 2.5640e+01, 1.0000e+01, 9.1090e+01,
       1.0000e+01, 1.0000e+01, 4.1160e+01, 4.6839e+02, 1.0000e+01,
       1.0000e+01, 1.1292e+02, 1.0000e+01, 1.0000e+01, 9.9500e+00,
       1.0000e+01, 1.0000e+01, 6.9820e+01, 1.0000e+01, 1.0000e+01,
       1.0000e+01, 1.5297e+02, 1.0000e+01, 1.0000e+01, 1.0000e+01,
       1.0000e+01, 1.0470e+01, 5.3357e+02, 1.0000e+01, 1.0000e+01,
       1.0000e+01, 1.0000e+01, 6.3290e+01, 1.0000e+01, 1.0000e+01,
       1.0000e+01, 1.2395e+02, 7.6540e+01, 3.5000e+01, 1.4690e+01,
       1.4010e+01, 1.0000e+01, 1.0000e+01, 3.5000e+01, 1.0000e+01,
       1.0000e+01, 1.0000e+01, 1.0000e+01, 1.0000e+01, 2.1316e+02,
       7.4360e+01, 6.7880e+01, 1.0000e+01, 1.6725e+02, 5.3140e+01,
       3.5000e+01, 1.0000e+01, 2.5090e+01, 1.0000e+01, 1.8132e+02,
       3.5000e+01, 1.3016e+02, 1.0000e+01, 4.8605e+02, 4.2380e+01,
       1.0391e+02, 2.3725e+02, 4.5090e+01, 5.1830e+01, 1.0000e+01,
       1.0000e+01, 3.5000e+01, 4.3770e+01, 3.5000e+01, 2.0400e+01,
       1.4044e+02, 1.0000e+01, 1.0000e+01, 1.0000e+01, 1.0000e+01,
       1.0000e+01, 4.5140e+01, 1.6290e+01, 3.5000e+01, 2.9330e+01,
       1.0000e+01, 1.0000e+01, 1.0000e+01, 1.0000e+01, 1.0000e+01,
       2.5010e+01, 1.8181e+02, 1.0000e+01, 3.5000e+01, 1.0000e+01,
       1.6640e+01, 1.0000e+01, 8.6100e+01, 1.8800e+02, 9.0490e+01,
       1.0000e+01, 1.0000e+01, 1.9186e+02, 1.0000e+01, 1.0000e+01,
       3.5000e+01, 1.0000e+01, 1.1536e+02, 1.0000e+01, 1.0000e+01,
       3.7721e+02, 6.3670e+01, 7.8550e+01, 5.1430e+01, 2.4226e+02,
       4.5660e+01, 1.0000e+01, 1.5310e+02, 1.3880e+01, 8.8000e+00,
       1.0000e+01, 6.5570e+01, 4.0000e+01, 3.5000e+01, 5.3160e+01,
       9.4700e+00, 2.7230e+01, 1.0000e+01, 6.0520e+02, 1.5339e+02,
       1.0000e+01, 1.0000e+01, 4.0940e+01, 1.0000e+01, 1.0000e+01,
       1.0000e+01, 6.7050e+01, 1.2725e+02, 1.3770e+01, 1.2202e+02,
       1.0000e+01, 1.0028e+02, 1.0000e+01, 1.0000e+01, 1.2170e+02,
       1.0000e+01, 4.9036e+02, 7.1140e+01, 4.0090e+01, 3.7430e+01,
       1.0000e+01, 5.5077e+02, 6.4040e+01, 9.3300e+00, 5.2798e+02,
       3.9300e+00, 1.0000e+01, 4.0810e+01, 2.7372e+02, 1.0000e+01,
       1.0000e+01, 5.1980e+01, 1.0000e+01, 1.0000e+01, 3.5000e+01,
       1.0000e+01, 1.0000e+01, 2.1000e+00, 1.2016e+02, 5.3610e+01,
       6.8400e+01, 7.3150e+01, 2.1260e+02, 1.6893e+02, 7.8770e+01,
       1.7110e+02, 4.8255e+02, 3.5000e+01, 1.0000e+01, 3.1160e+01,
       4.7420e+01, 3.6170e+01, 6.2320e+01, 1.0073e+02, 1.1993e+02,
       2.9170e+01, 1.0000e+01, 1.1844e+02, 1.7430e+01, 1.0000e+01,
       6.1590e+01, 1.8810e+01, 1.5361e+02, 9.2860e+01, 4.8296e+02,
       7.6560e+01, 1.0000e+01, 1.0000e+01, 1.0000e+01, 5.3973e+02,
       5.0372e+02, 5.9451e+02, 5.3977e+02, 8.8560e+01, 1.0000e+01,
       4.4250e+01, 3.0500e+01, 3.5000e+01, 1.0000e+01, 1.8704e+02,
       1.0000e+01, 3.1700e+01, 2.1725e+02, 1.9963e+02, 2.3488e+02,
       1.4845e+02, 4.6990e+01, 1.0000e+01, 8.0770e+01, 1.0000e+01,
       3.8000e+01, 3.7000e+01, 5.0000e+00, 6.0000e+01, 3.3000e+01,
       3.8000e+01, 3.7000e+01, 7.0890e+01, 1.0000e+01, 1.0400e+01,
       3.7000e+00, 3.2600e+01, 2.0000e+01, 1.2900e+02, 4.5000e+01,
       1.0100e+01, 3.5000e+01, 2.6500e+01, 1.8000e+00, 1.6790e+01,
       1.0000e+01, 3.4700e+01, 4.0000e+00, 5.0000e+00, 6.0000e+00,
       1.3000e+01, 1.1000e+01, 1.1000e+01, 2.3400e+00, 2.5110e+01,
       3.6400e+01, 1.0000e+01, 1.2000e+01, 5.3000e+01, 1.3000e+01,
       8.3100e+01, 1.3110e+01, 3.5000e+01, 0.0000e+00, 1.1780e+01,
       1.5880e+01, 1.3000e+01, 6.0000e+00, 1.1000e+01, 1.4000e+01,
       1.5530e+01, 5.6100e+00, 5.0000e+00, 8.4880e+01, 2.4840e+01,
       3.2940e+01, 1.1000e+01, 6.0000e+00, 1.1000e+01, 1.3000e+01,
       5.0000e+00, 3.5000e+01, 1.0000e+01, 1.4648e+02, 1.0000e+01,
       1.0000e+01, 1.0000e+01, 1.4638e+02, 1.0000e+01, 1.4429e+02,
       1.0000e+01, 9.0000e+00, 1.2000e+01, 6.8300e+00, 1.3000e+01,
       8.0000e+00, 1.0000e+01, 1.3790e+01, 1.1000e+01, 1.2000e+01,
       2.3490e+01, 8.0000e+00, 2.0370e+01, 9.0000e+00, 1.0000e+01,
       1.0000e+01, 1.0000e+01, 1.3727e+02, 1.4000e+01, 3.6500e+01,
       1.4700e+01, 7.6770e+01, 0.0000e+00, 2.5080e+01, 9.8420e+01,
       1.0000e+01, 1.0000e+01, 1.0000e+01, 1.1000e+01, 5.5900e+00,
       1.1000e+01, 1.4000e+01, 1.8800e+00, 3.1000e+00, 6.9600e+01,
       1.4100e+01, 1.0000e+01, 1.0000e+01, 1.7800e+01, 3.5200e+01,
       5.4000e-01, 4.5300e+01, 1.3600e+00, 1.0000e+01, 1.5000e+01,
       1.8000e+01, 6.3500e+00, 6.9700e+00, 1.0000e+01, 9.0000e+00,
       1.8000e+01, 6.7800e+00, 1.5597e+02, 5.0000e+00, 9.1100e+01,
       4.7000e+00, 2.4100e+00, 3.7000e+00, 1.0000e+01, 3.5000e+01,
       1.2240e+01, 1.2050e+02, 2.5680e+01, 2.5570e+01, 3.2240e+01,
       3.7300e+00, 1.8900e+00, 4.4100e+00, 1.4000e+01, 1.1000e+01,
       9.0000e+00, 1.3000e+01, 1.0000e+01, 2.1000e+01, 1.0000e+01,
       6.9800e+00, 1.0000e+01, 1.1900e+00, 9.0000e+00, 1.2000e+01,
       1.4000e+01, 7.5000e+00, 2.6670e+01, 5.3600e+01, 4.5690e+01,
       5.6910e+01, 3.0810e+01, 1.6000e+01, 1.2670e+01, 3.5000e+01,
       5.3000e+00, 1.0000e+01, 3.9500e+01, 1.4120e+01, 9.1000e+01,
       6.0000e+00, 1.6246e+02, 1.0000e+01, 9.9300e+00, 1.1000e+01,
       2.2000e+01, 6.7500e+00, 1.1000e+01, 6.0000e+00, 1.2000e+00,
       1.1000e+01, 1.3000e+01, 1.2000e+01, 3.2310e+01, 6.0000e+00,
       1.6000e+01, 3.2800e+01, 1.3000e+01, 1.2400e+01, 1.0000e+01,
       1.1087e+02, 4.5600e+01, 1.0040e+01, 1.0000e+01, 1.0080e+02,
       1.0210e+02, 3.6400e+01, 1.8700e+00, 2.5600e+01, 6.9000e+00,
       2.4040e+01])

type(EQ_data['depth'].values)

numpy.ndarray

Indexing and Slicing¶

Arrays and dataframes have two axes of indices, rows and columns. Remember that python indexing starts at zero, and the end bounds are generally exclusive.

indices

Source: Python for Data Analysis (2nd Edition) McKinney, W.

Using square brackes we can select subsections of tables to work with:

slicing

Source: Python for Data Analysis (2nd Edition) McKinney, W.

# generate a random array
arr_data = np.random.randn(10,5)
arr_data

array([[ 4.37604485e-01,  3.26020556e-01, -2.36903924e-03,
         4.70153904e-01,  9.61869263e-01],
       [-1.12741270e+00, -4.63232607e-01, -1.34920519e+00,
        -1.42607446e+00, -1.45081887e+00],
       [ 7.84472565e-01,  1.03299394e-01,  5.22566224e-01,
         1.08501171e+00,  1.00985635e+00],
       [ 3.95527376e-01,  1.83653584e+00, -2.10280867e-01,
         1.35102313e+00,  1.98818051e-01],
       [-5.92416692e-01, -1.00918554e+00,  4.15165548e-01,
        -1.41612380e+00,  2.86959690e-01],
       [ 2.48924308e-01, -5.06825885e-01, -9.51352788e-01,
         1.13902885e+00, -6.13957882e-02],
       [ 1.13111383e+00, -2.26036173e-01,  1.56250412e+00,
        -4.93195951e-01, -4.42609029e-01],
       [ 2.65890025e-02,  4.01799801e-01, -5.28542140e-01,
         5.10127706e-01, -3.85654941e-01],
       [-1.63634557e+00,  1.38423520e+00,  3.22007313e-01,
        -6.66468990e-01,  2.62433300e+00],
       [ 3.11680127e+00, -5.62329638e-01, -4.57055042e-01,
        -1.01879027e+00,  8.71575933e-01]])

slice out the first 3 rows of arr_data

a = arr_data[:3]
a

array([[ 0.43760449,  0.32602056, -0.00236904,  0.4701539 ,  0.96186926],
       [-1.1274127 , -0.46323261, -1.34920519, -1.42607446, -1.45081887],
       [ 0.78447257,  0.10329939,  0.52256622,  1.08501171,  1.00985635]])

slice out the last 2 columns of arr_data

b = arr_data[:,-2:]  
b

array([[ 0.4701539 ,  0.96186926],
       [-1.42607446, -1.45081887],
       [ 1.08501171,  1.00985635],
       [ 1.35102313,  0.19881805],
       [-1.4161238 ,  0.28695969],
       [ 1.13902885, -0.06139579],
       [-0.49319595, -0.44260903],
       [ 0.51012771, -0.38565494],
       [-0.66646899,  2.624333  ],
       [-1.01879027,  0.87157593]])

#Or this works too
b = arr_data[:,3:] 
b

array([[ 0.4701539 ,  0.96186926],
       [-1.42607446, -1.45081887],
       [ 1.08501171,  1.00985635],
       [ 1.35102313,  0.19881805],
       [-1.4161238 ,  0.28695969],
       [ 1.13902885, -0.06139579],
       [-0.49319595, -0.44260903],
       [ 0.51012771, -0.38565494],
       [-0.66646899,  2.624333  ],
       [-1.01879027,  0.87157593]])

Slicing a DataFrame is a bit different because you can reference the index labels and use .iloc.

slice out the first 10 rows of EQ_data

EQ_data.iloc[:10]

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	...	updated	place	type	horizontalError	depthError	magError	magNst	status	locationSource	magSource
0	2020-09-07T06:12:39.688Z	-17.1102	168.5034	10.00	6.2	mww	NaN	41.0	2.072	1.03	...	2020-11-29T09:47:05.444Z	72 km NNE of Port-Vila, Vanuatu	earthquake	6.9	1.8	0.058	29.0	reviewed	us	us
1	2020-09-11T07:35:57.187Z	-21.3968	-69.9096	51.00	6.2	mww	NaN	72.0	0.077	0.88	...	2020-11-21T20:10:36.040Z	82 km NNE of Tocopilla, Chile	earthquake	6.3	1.9	0.071	19.0	reviewed	us	us
2	2020-09-12T02:44:11.224Z	38.7482	142.2446	34.00	6.1	mww	NaN	47.0	2.232	0.97	...	2020-11-21T20:10:37.040Z	58 km SE of Ōfunato, Japan	earthquake	5.7	1.7	0.057	30.0	reviewed	us	us
3	2020-09-08T00:45:20.853Z	-4.8713	129.7548	172.00	5.9	mww	NaN	13.0	3.154	0.78	...	2020-11-17T19:44:53.040Z	193 km SSE of Amahai, Indonesia	earthquake	5.3	1.8	0.098	10.0	reviewed	us	us
4	2020-09-12T08:34:27.321Z	-17.2562	167.6792	10.00	5.9	mww	NaN	64.0	1.856	0.69	...	2020-11-21T20:10:38.040Z	85 km NW of Port-Vila, Vanuatu	earthquake	6.8	1.9	0.043	51.0	reviewed	us	us
5	2020-09-07T06:29:14.938Z	-17.1622	168.5076	10.00	5.7	mww	NaN	41.0	2.116	0.83	...	2020-11-17T19:44:50.040Z	66 km NNE of Port-Vila, Vanuatu	earthquake	7.0	1.8	0.050	38.0	reviewed	us	us
6	2020-09-09T07:18:40.291Z	4.1773	126.6447	17.00	5.7	mww	NaN	32.0	3.062	1.17	...	2020-11-21T20:10:33.040Z	188 km SE of Sarangani, Philippines	earthquake	6.8	1.7	0.089	12.0	reviewed	us	us
7	2020-09-07T17:40:44.173Z	-24.5115	-111.9893	10.00	5.6	mww	NaN	34.0	3.501	1.10	...	2020-11-17T19:44:52.040Z	Easter Island region	earthquake	10.1	1.8	0.066	22.0	reviewed	us	us
8	2020-09-12T02:37:29.903Z	-17.8804	-178.0054	559.66	5.6	mww	NaN	36.0	3.554	1.06	...	2020-11-21T20:10:37.040Z	284 km E of Levuka, Fiji	earthquake	9.1	4.9	0.093	11.0	reviewed	us	us
9	2020-09-08T08:28:53.492Z	-15.1737	-172.9796	25.00	5.4	mww	NaN	54.0	1.710	0.93	...	2020-11-17T19:44:55.040Z	123 km NE of Hihifo, Tonga	earthquake	7.2	1.8	0.093	11.0	reviewed	us	us

10 rows × 22 columns

slice out the a chunk of depths starting at index 5 and up to (but excluding) index 10

EQ_data.iloc[5:10]['depth']

   10.00
   17.00
   10.00
  559.66
   25.00
Name: depth, dtype: float64

Notice that this is still a Series with corresponding index values. If you just want the values from that chunk and not the index labels use .values.

EQ_data.iloc[5:10]['depth'].values

array([ 10.  ,  17.  ,  10.  , 559.66,  25.  ])

Boolean Indexing¶

We can use Boolean (i.e. logical) indexing to select values from our DataFrame where the argument we want is True. You’ll use the logical symbols (<,>,==,&,|,~).

Use Boolean Indexing to filter out data so that we are only looking at rows with magnitudes larger than or equal to 6.0

EQ_data[EQ_data['mag']>=6.0]

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	...	updated	place	type	horizontalError	depthError	magError	magNst	status	locationSource	magSource
0	2020-09-07T06:12:39.688Z	-17.1102	168.5034	10.0	6.2	mww	NaN	41.0	2.072	1.03	...	2020-11-29T09:47:05.444Z	72 km NNE of Port-Vila, Vanuatu	earthquake	6.9	1.8	0.058	29.0	reviewed	us	us
1	2020-09-11T07:35:57.187Z	-21.3968	-69.9096	51.0	6.2	mww	NaN	72.0	0.077	0.88	...	2020-11-21T20:10:36.040Z	82 km NNE of Tocopilla, Chile	earthquake	6.3	1.9	0.071	19.0	reviewed	us	us
2	2020-09-12T02:44:11.224Z	38.7482	142.2446	34.0	6.1	mww	NaN	47.0	2.232	0.97	...	2020-11-21T20:10:37.040Z	58 km SE of Ōfunato, Japan	earthquake	5.7	1.7	0.057	30.0	reviewed	us	us

3 rows × 22 columns

Sorting¶

DataFrames can be sorted by the values in a given column (.sort_values).

EQ_data.sort_values(by=['depth']).head()

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	...	updated	place	type	horizontalError	depthError	magError	magNst	status	locationSource	magSource
298	2020-09-11T19:15:33.180Z	43.528800	-105.332200	0.00	3.20	ml	NaN	68.0	1.10900	0.75	...	2020-11-21T20:10:49.040Z	27 km SSE of Wright, Wyoming	mining explosion	2.50	1.80	0.053	46.0	reviewed	us	us
347	2020-09-12T16:00:45.986Z	43.994900	-105.389800	0.00	2.90	ml	NaN	100.0	0.98400	0.34	...	2020-11-21T20:10:49.040Z	26 km SSE of Antelope Valley-Crestview, Wyoming	mining explosion	6.70	1.80	0.097	14.0	reviewed	us	us
365	2020-09-10T12:09:31.800Z	44.322333	-110.520833	0.54	2.80	md	10.0	118.0	0.05854	0.27	...	2020-11-21T20:10:35.040Z	59 km SE of West Yellowstone, Montana	earthquake	0.82	0.40	0.200	4.0	reviewed	uu	uu
402	2020-09-10T18:33:51.740Z	44.333333	-110.505500	1.19	2.62	ml	17.0	125.0	0.07115	0.18	...	2020-11-21T20:10:35.040Z	60 km SE of West Yellowstone, Montana	earthquake	0.57	0.50	0.264	8.0	reviewed	uu	uu
429	2020-09-10T11:40:22.540Z	44.320333	-110.497000	1.20	2.56	ml	19.0	134.0	0.06052	0.21	...	2020-11-21T20:10:34.040Z	61 km SE of West Yellowstone, Montana	earthquake	0.83	0.73	0.321	8.0	reviewed	uu	uu

5 rows × 22 columns

You can reverse the order of sorting with ascending=False.

EQ_data.sort_values(by=['depth'],ascending=False).head()

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	...	updated	place	type	horizontalError	depthError	magError	magNst	status	locationSource	magSource
168	2020-09-07T22:22:20.306Z	-20.8541	-178.6937	605.20	4.3	mb	NaN	115.0	4.359	0.82	...	2020-11-17T19:44:53.040Z	Fiji region	earthquake	14.9	8.6	0.116	21.0	reviewed	us	us
241	2020-09-13T03:50:49.167Z	-21.8264	-179.4210	594.51	4.1	mb	NaN	112.0	4.705	0.73	...	2020-11-21T20:10:46.040Z	Fiji region	earthquake	12.9	8.5	0.115	21.0	reviewed	us	us
8	2020-09-12T02:37:29.903Z	-17.8804	-178.0054	559.66	5.6	mww	NaN	36.0	3.554	1.06	...	2020-11-21T20:10:37.040Z	284 km E of Levuka, Fiji	earthquake	9.1	4.9	0.093	11.0	reviewed	us	us
191	2020-09-13T03:28:35.972Z	-24.1523	-179.9576	550.77	4.3	mb	NaN	143.0	17.743	0.28	...	2020-11-21T20:10:46.040Z	south of the Fiji Islands	earthquake	19.1	13.7	0.147	14.0	reviewed	us	us
242	2020-09-13T04:05:14.608Z	6.0551	123.8095	539.77	4.1	mb	NaN	76.0	2.026	0.63	...	2020-11-21T20:10:46.040Z	45 km WSW of Palimbang, Philippines	earthquake	15.9	14.2	0.198	7.0	reviewed	us	us

5 rows × 22 columns

Data cleaning and inspection¶

.drop() can be used to drop whole columns from a DataFrame.

EQ_data_concise = EQ_data.drop(['magType','nst','gap','dmin','rms','net','id','updated','place','type','horizontalError','depthError','magError','magNst','status','locationSource','magSource',], axis='columns')
EQ_data_concise.head()

	time	latitude	longitude	depth	mag
0	2020-09-07T06:12:39.688Z	-17.1102	168.5034	10.0	6.2
1	2020-09-11T07:35:57.187Z	-21.3968	-69.9096	51.0	6.2
2	2020-09-12T02:44:11.224Z	38.7482	142.2446	34.0	6.1
3	2020-09-08T00:45:20.853Z	-4.8713	129.7548	172.0	5.9
4	2020-09-12T08:34:27.321Z	-17.2562	167.6792	10.0	5.9

.unique() returns the unique values from the specified object.

unique_mags = EQ_data_concise['mag'].unique()
unique_mags.sort()
unique_mags

array([2.5 , 2.51, 2.53, 2.54, 2.55, 2.56, 2.57, 2.58, 2.6 , 2.61, 2.62,
       2.63, 2.64, 2.65, 2.66, 2.68, 2.69, 2.7 , 2.74, 2.75, 2.76, 2.77,
       2.78, 2.79, 2.8 , 2.81, 2.82, 2.84, 2.86, 2.9 , 2.91, 2.92, 2.95,
       2.96, 2.97, 2.98, 2.99, 3.  , 3.01, 3.02, 3.03, 3.06, 3.1 , 3.17,
       3.18, 3.2 , 3.21, 3.23, 3.26, 3.3 , 3.32, 3.34, 3.42, 3.43, 3.44,
       3.46, 3.5 , 3.53, 3.59, 3.6 , 3.7 , 3.71, 3.72, 3.74, 3.75, 3.8 ,
       3.83, 3.9 , 4.  , 4.1 , 4.2 , 4.22, 4.3 , 4.33, 4.36, 4.4 , 4.5 ,
       4.6 , 4.7 , 4.8 , 4.9 , 5.  , 5.1 , 5.2 , 5.4 , 5.6 , 5.7 , 5.9 ,
       6.1 , 6.2 ])

.value_counts() returns the count of each unique value from the specified object. This functionality can be used to find duplicate values.

EQ_data_concise['mag'].value_counts()

40    38
50    35
20    32
30    31
60    31
        ..
69     1
53     1
72     1
21     1
44     1
Name: mag, Length: 90, dtype: int64

Finding missing data (NaNs)¶

NaN stands for not a number and is used as a placeholder in data tables where no value exists. np.isnan returns a boolean object with True where NaNs appear in the DataFrame.

np.isnan(EQ_data['nst'])

     True
     True
     True
     True
     True
       ...  
   True
  False
   True
  False
   True
Name: nst, Length: 451, dtype: bool

~np.isnan(EQ_data['nst'])

    False
    False
    False
    False
    False
       ...  
  False
   True
  False
   True
  False
Name: nst, Length: 451, dtype: bool

You can use this boolean object to filter-out rows that contain NaNs.

EQ_data[~np.isnan(EQ_data['nst'])]

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	...	updated	place	type	horizontalError	depthError	magError	magNst	status	locationSource	magSource
162	2020-09-12T23:22:04.540Z	19.304600	-64.389600	40.00	4.36	md	19.0	233.0	1.86560	0.4800	...	2020-11-21T20:10:38.040Z	115 km NNE of Cruz Bay, U.S. Virgin Islands	earthquake	3.57	29.63	0.060000	8.0	reviewed	pr	pr
163	2020-09-13T07:12:58.980Z	19.396000	-64.284100	35.00	4.33	md	26.0	234.0	2.00520	0.2900	...	2020-11-21T20:10:39.040Z	129 km NNE of Cruz Bay, U.S. Virgin Islands	earthquake	1.81	21.72	0.130000	16.0	reviewed	pr	pr
195	2020-09-11T07:55:45.450Z	36.440667	-117.994500	3.93	4.22	ml	45.0	66.0	0.06908	0.1800	...	2020-11-21T20:10:36.040Z	18km SSE of Lone Pine, CA	earthquake	0.18	0.57	0.140000	320.0	reviewed	ci	ci
260	2020-09-08T01:19:54.510Z	19.121300	-64.398500	38.00	3.83	md	17.0	338.0	1.74990	0.4000	...	2020-11-17T19:44:53.040Z	96 km NNE of Cruz Bay, U.S. Virgin Islands	earthquake	3.23	24.52	0.110000	14.0	reviewed	pr	pr
261	2020-09-08T06:35:40.270Z	19.316600	-64.570800	37.00	3.83	md	20.0	335.0	1.73570	0.4100	...	2020-11-17T19:44:54.040Z	111 km NNE of Cruz Bay, U.S. Virgin Islands	earthquake	3.32	26.87	0.120000	10.0	reviewed	pr	pr
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
435	2020-09-08T11:03:53.600Z	18.027300	-66.828000	16.00	2.51	md	12.0	117.0	0.07310	0.1500	...	2020-09-08T11:14:28.314Z	2 km ESE of Yauco, Puerto Rico	earthquake	0.58	0.73	0.150000	5.0	reviewed	pr	pr
436	2020-09-13T17:55:42.130Z	19.182333	-155.475833	32.80	2.51	md	59.0	82.0	NaN	0.1200	...	2020-11-21T20:10:40.040Z	2 km S of Pāhala, Hawaii	earthquake	0.44	0.60	0.152709	27.0	reviewed	hv	hv
437	2020-09-07T08:24:04.300Z	17.932000	-66.947100	13.00	2.50	md	16.0	221.0	0.08040	0.1300	...	2020-09-07T09:46:21.114Z	6 km SW of Guánica, Puerto Rico	earthquake	0.79	0.30	0.120000	11.0	reviewed	pr	pr
447	2020-09-11T12:23:38.700Z	44.315167	-110.494500	1.87	2.50	ml	15.0	144.0	0.05648	0.1800	...	2020-11-21T20:10:36.040Z	61 km SE of West Yellowstone, Montana	earthquake	0.50	2.06	0.377000	8.0	reviewed	uu	uu
449	2020-09-12T05:56:08.720Z	38.169800	-117.964800	6.90	2.50	ml	28.0	47.8	0.03100	0.1599	...	2020-11-21T20:10:37.040Z	27 km SSE of Mina, Nevada	earthquake	NaN	0.60	0.290000	15.0	reviewed	nn	nn

103 rows × 22 columns

Python Intro to Geoscience

4.1 Data Wrangling and Advanced Indexing¶

Additional Required Reading: Functions¶

Numpy Arrays and DataFrames¶

nparray¶

DataFrames¶

Indexing and Slicing¶

Boolean Indexing¶

Sorting¶

Data cleaning and inspection¶

Finding missing data (NaNs)¶

Further Reading (Optional)¶