MiningofMassiveDatasetsJure LeskovecStanford Univ.Anand RajaranMilliway LabsJeffrey D.UllnStanford Univ.Copyright C 2010,2011,2012,2013,2014 Anand Rajaran,Jure Leskovec,and Jeffrey D.UllnPrefaceThis book evolved from terial developed over several years by Anand Raja-ran and Jeff Ulln for a one-quarter course at Stanford.The courseCS345A,titled "Web Mining,"was designed as an advanced graduate course,although it has become accessible and interesting to advanced undergraduates.When Jure Leskovec joined the Stanford faculty,we reorganized the terialconsiderably.He introduced a new course CS224W on network ysis andadded terial to CS345A,which was renumbered CS246.The three authorsalso introduced a large-scale data-mining project course,CS341.The book nowcontains terial taught in all three courses.What the Book Is AboutAt the highest level of description,this book is about data mining.However,it focuses on data mining of very large amounts of data,that is,data so largeit does not fit in in memory.Because of the emphasis on size,ny of ourexamples are about the Web or data derived from the Web.Further.the booktakes an algorithmic point of view:data mining is about applying algorithmsto data,rather than using data to "train"a chine-learning engine of somesort.The principal topics covered are:1.Distributed file systems and p-reduce as a tool for creating parallelalgorithms that succeed on very large amounts of data.2.Similarity search,including the key techniques of minhashing and locality-sensitive hashing.3.Data-stream processing and specialized algorithms for dealing with datathat arrives so fast it must be processed immediately or lost.4.The technology of search engines,including Google's PageRank,link-spamdetection,and the hubs-and-authorities approach.5.Frequent-itemset mining,including association rules,rket-baskets,theA-Priori Algorithm and its improvements.6.Algorithms for clustering very large,high-dimensional datasets.