Prediction#

Much of the “Big Data” collection we hear about is focused on learning from the past. Companies want to know who is most likely to buy a product so they can create targeted ads. Doctors want to know how quickly someone will recover if given a certain treatment so they can choose a treatment plan. Data scientists create models that find patterns in data from the past and use those models to predict what will happen in new data. But, how are these models created?

In Section 10.1, we discussed relationships, or associations, between variables in a dataset. Recall the example about ice cream and swimming. There is an association between the weather being warm and the number of people swimming at the beach as well as between the weather being warm and the amount of ice cream sales. When the weather is warmer, more people buy ice cream.

Imagine you are an executive at a popular ice cream company. You want to know the best time to premier a new product, so you ask your data scientist to predict what month will have the highest demand for ice cream in the next year. Your data scientist will look for patterns in historical data and find that months with the warmest weather have the highest ice cream sales, and as a result predict that July or August would be the best months to release the new product.

Often, situations are more complicated, imagine you are trying to predict the selling price of a house. There are likely multiple variables that will contribute to that price: number of bedrooms, number of bathrooms, square footage, etc. How can we find these patterns and choose variables to use for prediction? Often, the first step is to plot your data.

Below is a dataset containing in formation on housing prices in Athens, Ohio. Let’s look at plots of the relationship between characteristics of a house and its selling price.

housing_df = pd.read_csv("../../data/Housing.csv")
housing_df.head()
floor_size bed_room_count built_year sold_date sold_price room_count garage_size parking_lot
0 2068 3 2003 Aug2015 195500 6 768 3
1 3372 3 1999 Dec2015 385000 6 480 2
2 3130 3 1999 Jan2017 188000 7 400 2
3 3991 3 1999 Nov2014 375000 8 400 2
4 1450 2 1999 Jan2015 136000 7 200 1
plt.scatter(housing_df['bed_room_count'],housing_df['sold_price'])
plt.xlabel("Number of Bedrooms")
plt.ylabel("Selling Price")
plt.show()
../../_images/prediction_3_0.png

Though the discrete nature of one of the variables makes this plot look a bit strange, in this scatter plot, we can see a trend where houses with more bedrooms have higher selling prices.

plt.scatter(housing_df['floor_size'],housing_df['sold_price'])
plt.xlabel("Floor Size")
plt.ylabel("Selling Price")
plt.show()
../../_images/prediction_5_0.png

In this scatter plot, we can see a trend where larger houses have higher selling prices.

plt.scatter(housing_df['built_year'],housing_df['sold_price'])
plt.xlabel("Year Built")
plt.ylabel("Selling Price")
plt.show()
../../_images/prediction_7_0.png

This scatter plot looks like the points are randomly scattered indicating no trend between the year the house was built and its selling price.

In order to make a prediction about a new house’s selling price, we need a way of quantifying these trends and combining them into a single predictor. We will discuss multiple ways of doing this throughout the rest of this book, but in the next section, we will focus on the simplest way of quantifying a linear relationship: correlation.