Fire up graphlab create¶

(See Getting Started with SFrames for setup instructions)

import graphlab

# Limit number of worker processes. This preserves system memory, which prevents hosted notebooks from crashing.
graphlab.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 4)

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\lenovo\AppData\Local\Temp\graphlab_server_1481797256.log.0

This non-commercial license of GraphLab Create for academic use is assigned to 870891415@qq.com and will expire on December 06, 2017.

Load some house sales data¶

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

SFrame 是一种数据结构，在 Graphlab Create 中被用来表示单元数居

sales = graphlab.SFrame('home_data.gl/') # 这是一个公共数据集，记录了在西雅图区域被售出的房子的情况

sales

Exploring the data for housing sales¶

The house price is correlated with the number of square feet of living space.

view="Scatter Plot"：表示要绘制散点图

graphlab.canvas.set_target('ipynb')：canvas 绘制在 Notebook 内部

graphlab.canvas.set_target('ipynb')
sales.show(view="Scatter Plot", x="sqft_living", y="price") # x 轴表示居住面积，y 轴表示房价

Create a simple regression model of sqft_living to price¶

Split data into training and testing.
We use seed=0 so that everyone running this notebook gets the same results. In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).

通过调用 random.split(训练集比例，seed=随机种子) 函数，将数据分为训练集和测试集

seed=0：可以保证每次得到相同的结果

train_data,test_data = sales.random_split(.8,seed=0) # 80% 作为训练集，20% 作为测试集

Build the regression model using only sqft_living as a feature¶

创建回归模型

函数 graphlab.linear_regression.create()：创建线性回归模型

Graghlab Create 默认会自动选择算法

sqft_model = graphlab.linear_regression.create(train_data, target='price', features=['sqft_living'],validation_set=None)

Linear regression:

--------------------------------------------------------

Number of examples          : 17384

Number of features          : 1

Number of unpacked features : 1

Number of coefficients    : 2

Starting Newton Method

--------------------------------------------------------

+-----------+----------+--------------+--------------------+---------------+

| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |

Evaluate the simple model¶

mean()：求平均值

模型.evaluate(数据集)：评估建立的模型（max_error：最大误差；rmse：均方根误差）

print test_data['price'].mean()

543054.042563

print sqft_model.evaluate(test_data)

{'max_error': 4143550.8825285938, 'rmse': 255191.02870527358}

RMSE of about \$255,170!

Let's show what our predictions look like¶

Matplotlib is a Python plotting library that is also useful for plotting. You can install it with:

'pip install matplotlib'

%matplotlib inline：在 Notebook 上绘图

matplotlib.pyplot.plot(图形1的x轴, 图形1的y轴, 图形1的标记符号, 图形2的x轴, 图形2的y轴, 图形2的标记符号, ...)

模型.predict(数据集)：利用模型计算数据集的预测值

模型.get('coefficients'):输出模型的参数

import matplotlib.pyplot as plt
%matplotlib inline

# 画 2 个图：图形 1 绘制测试集的房屋面积和实际房价的关系；图形 2 试集的房屋面积和预测房价的关系
plt.plot(test_data['sqft_living'],test_data['price'],'.',
        test_data['sqft_living'],sqft_model.predict(test_data),'-')

[<matplotlib.lines.Line2D at 0x1d45d3c8>,
 <matplotlib.lines.Line2D at 0xf31beb8>]

Above: blue dots are original data, green line is the prediction from the simple regression.

Below: we can view the learned regression coefficients.

sqft_model.get('coefficients')
# 下面的表格中，第一行为截距，第二行为斜率

Explore other features in the data¶

To build a more elaborate model, we will explore using more features.

my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

sales[my_features].show()

view='BoxWhisker Plot'：绘制两个变量之间的关系（红色线：平均值；横线：最大值与最小值）

sales.show(view='BoxWhisker Plot', x='zipcode', y='price')

Pull the bar at the bottom to view more of the data.

98039 is the most expensive zip code.

Build a regression model with more features¶

my_features_model = graphlab.linear_regression.create(train_data,target='price',features=my_features,validation_set=None)

Linear regression:

--------------------------------------------------------

Number of examples          : 17384

Number of features          : 6

Number of unpacked features : 6

Number of coefficients    : 115

Starting Newton Method

--------------------------------------------------------

+-----------+----------+--------------+--------------------+---------------+

| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |

print my_features

['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

Comparing the results of the simple model with adding more features¶

print sqft_model.evaluate(test_data)
print my_features_model.evaluate(test_data)
# 可以看出最大误差和均方误差都比单一特征的模型有所降低

{'max_error': 4143550.8825285938, 'rmse': 255191.02870527358}
{'max_error': 3486584.509381705, 'rmse': 179542.4333126903}

The RMSE goes down from \$255,170 to \$179,508 with more features.

Apply learned models to predict prices of 3 houses¶

The first house we will use is considered an "average" house in Seattle.

house1 = sales[sales['id']=='5309101200']

house1

在 Notebook 中添加图片：<img src="url或者本地路径">

print house1['price']

[620000L, ... ]

print sqft_model.predict(house1)

[629584.8197281545]

print my_features_model.predict(house1)

[721918.9333272863]

In this case, the model with more features provides a worse prediction than the simpler model with only 1 feature. However, on average, the model with more features is better.

Prediction for a second, fancier house¶

We will now examine the predictions for a fancier house.

house2 = sales[sales['id']=='1925069082']

house2

print sqft_model.predict(house2)

[1261170.404099968]

print my_features_model.predict(house2)

[1446472.4690774973]

In this case, the model with more features provides a better prediction. This behavior is expected here, because this house is more differentiated by features that go beyond its square feet of living space, especially the fact that it's a waterfront house.

Last house, super fancy¶

Our last house is a very large one owned by a famous Seattleite.

预测比尔盖茨的 house。。。

bill_gates = {'bedrooms':[8], 
              'bathrooms':[25], 
              'sqft_living':[50000], 
              'sqft_lot':[225000],
              'floors':[4], 
              'zipcode':['98039'], 
              'condition':[10], 
              'grade':[10],
              'waterfront':[1],
              'view':[4],
              'sqft_above':[37500],
              'sqft_basement':[12500],
              'yr_built':[1994],
              'yr_renovated':[2010],
              'lat':[47.627606],
              'long':[-122.242054],
              'sqft_living15':[5000],
              'sqft_lot15':[40000]}

print my_features_model.predict(graphlab.SFrame(bill_gates))

[13749825.525719076]

The model predicts a price of over $13M for this house! But we expect the house to cost much more. (There are very few samples in the dataset of houses that are this fancy, so we don't expect the model to capture a perfect prediction here.)

id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors
7129300520	2014-10-13 00:00:00+00:00	221900	3	1	1180	5650	1
6414100192	2014-12-09 00:00:00+00:00	538000	3	2.25	2570	7242	2
5631500400	2015-02-25 00:00:00+00:00	180000	2	1	770	10000	1
2487200875	2014-12-09 00:00:00+00:00	604000	4	3	1960	5000	1
1954400510	2015-02-18 00:00:00+00:00	510000	3	2	1680	8080	1
7237550310	2014-05-12 00:00:00+00:00	1225000	4	4.5	5420	101930	1
1321400060	2014-06-27 00:00:00+00:00	257500	3	2.25	1715	6819	2
2008000270	2015-01-15 00:00:00+00:00	291850	3	1.5	1060	9711	1
2414600126	2015-04-15 00:00:00+00:00	229500	3	1	1780	7470	1
3793500160	2015-03-12 00:00:00+00:00	323000	3	2.5	1890	6560	2

condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat
3	7	1180	0	1955	0	98178	47.51123398
3	7	2170	400	1951	1991	98125	47.72102274
3	6	770	0	1933	0	98028	47.73792661
5	7	1050	910	1965	0	98136	47.52082
3	8	1680	0	1987	0	98074	47.61681228
3	11	3890	1530	2001	0	98053	47.65611835
3	7	1715	0	1995	0	98003	47.30972002
3	7	1060	0	1963	0	98198	47.40949984
3	7	1050	730	1960	0	98146	47.51229381
3	7	1890	0	2003	0	98038	47.36840673

long	sqft_living15	sqft_lot15
-122.25677536	1340.0	5650.0
-122.3188624	1690.0	7639.0
-122.23319601	2720.0	8062.0
-122.39318505	1360.0	5000.0
-122.04490059	1800.0	7503.0
-122.00528655	4760.0	101930.0
-122.32704857	2238.0	6819.0
-122.31457273	1650.0	9711.0
-122.33659507	1780.0	8113.0
-122.0308176	2390.0	7570.0

name	index	value	stderr
(intercept)	None	-47114.0206702	4923.34437753
sqft_living	None	281.957850166	2.16405465323