Spark机器学习（影印版英文版）

出版日期:2016-1-1
ISBN:9787564160918
作者:[英] Nick Penteeath
页数:319页

书籍目录

Preface

Chapter 1： Getting Up and Running with Spark

Installing and setting up Spark locally

Spark clusters

The Spark programming model

SparkContext and SparkConf

The Spark shell

Resilient Distributed Datasets

Creating RDDs

Spark operations

Caching RDDs

Broadcast variables and accumulators

The first step to a Spark program in Scala

The first step to a Spark program in Java

The first step to a Spark program in Python

Getting Spark running on Amazon EC2

Launching an EC2 Spark cluster

Summary

Chapter 2： Designing a Machine Learning System

Introducing MovieStream

Business use cases for a machine learning system

Personalization

Targeted marketing and customer segmentation

Predictive modeling and analytics

Types of machine learning models

The components of a data-driven machine learning system

Data ingestion and storage

Data cleansing and transformation

Model training and testing loop

Model deployment and integration

Model monitoring and feedback

Batch versus real time

An architecture for a machine learning system

Practical exercise

Summary

Chapter 3： Obtaining， Processing， and Preparing Data

with Spark

Accessing publicly available datasets

The MovieLens lOOk dataset

Exploring and visualizing your data

Exploring the user dataset

Exploring the movie dataset

Exploring the rating dataset

Processing and transforming your data

Filling in bad or missing data

Extracting useful features from your data

Numerical features

Categorical features

Derived features

Transforming timestamps into categorical features

Text features

Simple text feature extraction

Normalizing features

Using MLlib for feature normalization

Using packages for feature extraction

Summary

Chapter 4： Building a Recommendation Engine with Spark

Types of recommendation models

Content-based filtering

Collaborative filtering

Matrix factorization

Extracting the right features from your data

Extracting features from the MovieLens 100k dataset

Training the recommendation model

Training a model on the MovieLens 100k dataset

Training a model using implicit feedback data

Using the recommendation model

User recommendations

Generating movie recommendations from the MovieLens 100k dataset

Item recommendations

Generating similar movies for the MovieLens 100k dataset

Evaluating the performance of recommendation models

Mean Squared Error

Mean average precision at K

Using MLlib's built-in evaluation functions

RMSE and MSE

MAP

Summary

Chapter 5： Building a Classification Model with Spark

Types of classification models

Linear models

Logistic regression

Linear support vector machines

The na'fve Bayes model

Decision trees

Extracting the right features from your data

Extracting features from the Kaggle/StumbleUpon

evergreen classification dataset

Training classification models

Training a classification model on the Kaggle/StumbleUpon

evergreen classification dataset

Using classification models

Generating predictions for the Kaggle/StumbleUpon

evergreen classification dataset

Evaluating the performance of classification models

Accuracy and prediction error

Precision and recall

ROC curve and AUC

Improving model performance and tuning parameters

Feature standardization

Additional features

Using the correct form of data

Tuning model parameters

Linear models

Decision trees

The na'fve Bayes model

Cross-validation

Summary

Chapter 6： Buildin a~ssion Model with Spark

Types of regression models

Least squares regression

Decision trees for regression

Extracting the right features from your data

Extracting features from the bike sharing dataset

Creating feature vectors for the linear model

Creating feature vectors for the decision tree

Training and using regression models

Training a regression model on the bike sharing dataset

Evaluating the performance of regression models

Mean Squared Error and Root Mean Squared Error

Mean Absolute Error

Root Mean Squared Log Error

The R-squared coefficient

Computing performance metrics on the bike sharing dataset

Linear model

Decision tree

Improving model performance and tuning parameters

Transforming the target variable

Impact of training on log-transformed targets

Tuning model parameters

Creating training and testing sets to evaluate parameters

The impact of parameter settings for linear models

The impact of parameter settings for the decision tree

Summary

Chapter 7： Building a Clustering Model with Spark

Types of clustering models

K-means clustering

Initialization methods

Variants

Mixture models

Hierarchical clustering

Extracting the right features from your data

Extracting features from the MovieLens dataset

Extracting movie genre labels

Training the recommendation model

Normalization

Training a clustering model

Training a clustering model on the MovieLens dataset

Making predictions using a clustering model

Interpreting cluster predictions on the MovieLens dataset

Interpreting the movie clusters

Evaluating the performance of clustering models

Internal evaluation metrics

External evaluation metrics

Computing performance metrics on the MovieLens dataset

Tuning parameters for clustering models

Selecting K through cross-validation

Summary

Chapter 8： Dimensionality Reduction with Spark

Types of dimensionality reduction

Principal Components Analysis

Singular Value Decomposition

Relationship with matrix factorization

Clustering as dimensionality reduction

Extracting the right features from your data

Extracting features from the LFW dataset

Exploring the face data

Visualizing the face data

Extracting facial images as vectors

Normalization

Training a dimensionality reduction model

Running PCA on the LFW dataset

Visualizing the Eigenfaces

Interpreting the Eigenfaces

Using a dimensionality reduction model

Projecting data using PCA on the LFW dataset

The relationship between PCA and SVD

Evaluating dimensionality reduction models

Evaluating k for SVD on the LFW dataset

Summary

Chapter 9： Advanced Text Processing with Spark

What's so special about text data?

Extracting the right features from your data

Term weighting schemes

Feature hashing

Extracting the TF-IDF features from the 20 Newsgroups dataset

Exploring the 20 Newsgroups data

Applying basic tokenization

Improving our tokenization

Removing stop words

Excluding terms based on frequency

A note about stemming

Training a TF-IDF model

Analyzing the TF-IDF weightings

Using a TF-IDF model

Document similarity with the 20 Newsgroups dataset and

TF-IDF features

Training a text classifier on the 20 Newsgroups dataset

using TF-IDF

Evaluating the impact of text processing

Comparing raw features with processed TF-IDF features on the

20 Newsgroups dataset

Word2Vec models

Word2Vec on the 20 Newsgroups dataset

Summary

Chapter 10： Real-time Machine Learning withSpark Streaming

Online learning

Stream processing

An introduction to Spark Streaming

Input sources

Transformations

Actions

Window operators

Caching and fault tolerance with Spark Streaming

Creating a Spark Streaming application

The producer application

Creating a basic streaming application

Streaming analytics

Stateful streaming

Online learning with Spark Streaming

Streaming regression

A simple streaming regression program

Creating a streaming data producer

Creating a streaming regression model

Streaming K-means

Online model evaluation

Comparing model performance with Spark Streaming

Summary

Index

作者简介

Apache spark是一款全新开发的分布式框架，特别对低延迟任务和内存数据存储进行了优化。它结合了速度、可扩展性、内存处理以及容错性，是极少数适用于并行计算的框架之一，同时还非常易于编程，拥有一套灵活、表达能力丰富、功能强大的API设计。

《Spark机器学习（影印版 英文版）》指导你学习用于载入及处理数据的spark APl的基础知识，以及如何为各种机器学习模型准备适合的输入数据：另有详细的例子和实际生活中的真实案例来帮助你学习包括推荐系统、分类、回归、聚类、降维在内的常见机器学习模型，你还会看到如大规模文本处理之类的高级主题、在线机器学习的相关方法以及使用spa rk st reami ng进行模型评估。

Spark机器学习（影印版英文版）下载

发布书评

Spark机器学习（影印版英文版）

发布书评

类似图书

相关图书推荐

Spark机器学习（影印版 英文版）

发布书评

类似图书

相关图书推荐

Spark机器学习（影印版英文版）