Python Tutorial

Welcome to the Python tutorial version of Data Science Rosetta Stone. Before beginning this tutorial, please check to make sure you have Python 3.5.2 installed (this is not required, but this was the release used to generate the following examples). Also, the following Python packages are used throughout this tutorial. You may not need all of the following packages to fit your specific needs, but they are listed below, and also in Appendix Section 2 with more detail, for your information:

To install Python packages, you need to run the following in the command line/terminal of your computer:

pip install package_name
# or #
conda install package_name

Note: In Python, comments are indicated in code with a “#” character, and arrays and matrices are zero-indexed.

Now let’s get started! First, you need to import several very important Python packages for data manipulation and scientific computing. The pandas package is useful for data manipulation and the NumPy package is useful for scientific computing.

import pandas as pd
import numpy as np

1 Reading in Data and Basic Statistical Functions

1.1 Read in the data.

The following demonstrate importing data into Python given 3 different file formats. The pandas package is able to read all 3 formats, as well as many others, using Python IO tools.

a) Read the data in as a .csv file.

student = pd.read_csv('/Users/class.csv')

b) Read the data in as a .xls file.

# Notice you must specify the file location, as well as the name of the sheet 
# of the .xls file you want to import
student_xls = pd.read_excel(open('/Users/class.xls', 'rb'),
                            sheetname='class')

c) Read the data in as a .json file.

student_json = pd.read_json('/Users/class.json')

1.2 Find the dimensions of the data set.

The dimensions of a DataFrame in Python are known as an attribute of the object. Therefore, you can state the data name followed by .shape to return the dimensions of the data, with the first integer indicating the number of rows and the second indicating the number of columns.

print(student.shape)
## (19, 5)

1.3 Find basic information about the data set.

Information about a DataFrame is available by calling the info() function on the data.

print(student.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 19 entries, 0 to 18
## Data columns (total 5 columns):
## Name      19 non-null object
## Sex       19 non-null object
## Age       19 non-null int64
## Height    19 non-null float64
## Weight    19 non-null float64
## dtypes: float64(2), int64(1), object(2)
## memory usage: 840.0+ bytes
## None

1.4 Look at the first 5 (last 5) observations.

The first 5 observations of a DataFrame are available by calling the head() function on the data. By default, head() returns 5 observations. To return the first n observations, pass the integer n into the function. The tail() function is analogous and returns the last observations.

print(student.head())
##       Name Sex  Age  Height  Weight
## 0   Alfred   M   14    69.0   112.5
## 1    Alice   F   13    56.5    84.0
## 2  Barbara   F   13    65.3    98.0
## 3    Carol   F   14    62.8   102.5
## 4    Henry   M   14    63.5   102.5

1.5 Calculate means of numeric variables.

The means of numeric variables of a DataFrame are available by calling the mean() function on the data.

print(student.mean())
## Age        13.315789
## Height     62.336842
## Weight    100.026316
## dtype: float64

1.6 Compute summary statistics of the data set.

Summary statistics of a DataFrame are available by calling the describe() function on the data.

print(student.describe())
##              Age     Height      Weight
## count  19.000000  19.000000   19.000000
## mean   13.315789  62.336842  100.026316
## std     1.492672   5.127075   22.773933
## min    11.000000  51.300000   50.500000
## 25%    12.000000  58.250000   84.250000
## 50%    13.000000  62.800000   99.500000
## 75%    14.500000  65.900000  112.250000
## max    16.000000  72.000000  150.000000

1.7 Descriptive statistics functions applied to variables of the data set.

# Notice the subsetting of student with [] and the name of the variable in 
# quotes ("")
print(student["Weight"].std())
## 22.773933493879046
print(student["Weight"].sum())
## 1900.5
print(student["Weight"].count())
## 19
print(student["Weight"].max())
## 150.0
print(student["Weight"].min())
## 50.5
print(student["Weight"].median())
## 99.5

1.8 Produce a one-way table to describe the frequency of a variable.

a) Produce a one-way table of a discrete variable.

# columns = "count" indicates to make the descriptive portion of the table 
# the counts of each level of the index variable
print(pd.crosstab(index=student["Age"], columns="count"))
## col_0  count
## Age
## 11         2
## 12         5
## 13         3
## 14         4
## 15         4
## 16         1

b) Produce a one-way table of a categorical variable.

print(pd.crosstab(index=student["Sex"], columns="count"))
## col_0  count
## Sex
## F          9
## M         10

pd.crosstab()

1.9 Produce a two-way table to describe the frequency of two categorical or discrete variables.

# Notice the specification of a variable for the columns argument, instead 
# of "count"
print(pd.crosstab(index=student["Age"], columns=student["Sex"]))
## Sex  F  M
## Age
## 11   1  1
## 12   2  3
## 13   2  1
## 14   2  2
## 15   2  2
## 16   0  1