This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.
To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.
Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.
●Obtain data from websites, APIs, databases, and spreadsheets
●Perform scrub operations on plain text, CSV, HTML/XML, and JSON
●Explore data, compute descriptive statistics, and create visualizations
●Manage your data science workflow using Drake
●Create reusable tools from one-liners and existing Python or R code
●Parallelize and distribute data-intensive pipelines using GNU Parallel
●Model data with dimensionality reduction, clustering, regression, and classification algorithms
Chapter 1 Introduction
Overview
Data Science Is OSEMN
Intermezzo Chapters
What Is the Command Line?
Why Data Science at the Command Line?
A Real-World Use Case
Further Reading
Chapter 2 Getting Started
Overview
Setting Up Your Data Science Toolbox
Essential Concepts and Tools
Further Reading
Chapter 3 Obtaining Data
Overview
Copying Local Files to the Data Science Toolbox
Decompressing Files
Converting Microsoft Excel Spreadsheets
Querying Relational Databases
Downloading from the Internet
Calling Web APIs
Further Reading
Chapter 4 Creating Reusable Command-Line Tools
Overview
Converting One-Liners into Shell Scripts
Creating Command-Line Tools with Python and R
Further Reading
Chapter 5 Scrubbing Data
Overview
Common Scrub Operations for Plain Text
Working with CSV
Working with HTML/XML and JSON
Common Scrub Operations for CSV
Further Reading
Chapter 6 Managing Your Data Workflow
Overview
Introducing Drake
Installing Drake
Obtain Top Ebooks from Project Gutenberg
Every Workflow Starts with a Single Step
Well, That Depends
Rebuilding Specific Targets
Discussion
Further Reading
Chapter 7 Exploring Data
Overview
Inspecting Data and Its Properties
Computing Descriptive Statistics
Creating Visualizations
Further Reading
Chapter 8 Parallel Pipelines
Overview
Serial Processing
Parallel Processing
Distributed Processing
Discussion
Further Reading
Chapter 9 Modeling Data
Overview
More Wine, Please!
Dimensionality Reduction with Tapkee
Clustering with Weka
Regression with SciKit-Learn Laboratory
Classification with BigML
Further Reading
Chapter 10 Conclusion
Let’s Recap
Three Pieces of Advice
Where to Go from Here?
Getting in Touch
《快学熟用D3》内容简介:本书所讲的D3.js其实是数据可视化这一门类的库,市面上讲解它的书籍早已汗牛充栋。这本书虽然名字是“D3
《深度强化学习核心算法与应用》内容简介:强化学习是实现决策智能的主要途径之一。经历数十年的发展,强化学习领域已经枝繁叶茂,
《C++程序设计(第3版)》内容简介:本书以介绍C++语言的基本知识为主,旨在帮助读者建立面向对象程序设计的编程思想,主要内容包括
《本色朱德》内容简介:朱德的一生曲折而传奇,他参加过科举考试,加入过同盟会,投身过护国运动,留学过德国和苏联,经历了北伐战
WelcometotheVisionRevolution.WithMicrosoftsKinectleadingtheway,youcannowuse3Dcom...
《ARM嵌入式系统开发:软件设计与优化》从软件设计的角度,全面、系统地介绍了ARM处理器的基本体系结构和软件设计与优化方法。内容
《中信国学大典:六祖坛经》内容简介:中信国学大典(50册)是中信出版社引进自香港中华书局的一套深具国际视野、贴近当代社会的中
基于模型的设计及其嵌入式实现 本书特色 《基于模型的设计及其嵌入式实现》特点:《基于模型的设计及其嵌入式实现》是国内**部系统介绍基于模型设计的著作,主要内容由...
《第一本心理学漫画书:梦的解析》内容简介:《第一本心理学漫画书:梦的解析》系列漫画是根据弗洛伊德最主要的三部著作《梦的解析
《陶行知教育箴言》内容简介:《陶行知教育箴言》从陶行知先生诸多著作中精选出最经典的内容,将其平生教育教学研究与实践的精髓汇
这本书从历史的角度解读了印刷字体,从形制、体制到印制的演进过程,以及对中国文化、经济、科技和人们生活产生的影响。还可以了
《财团首户:无锡荣家》内容简介:本书为“中国近代实业家丛书”中的无锡荣氏家族卷。习近平总书记在企业家座谈会与考察江苏期间两
《商事指导性案例的司法适用》内容简介:指导性案例制度是一项具有中国特色的司法制度。为全面落实指导性案例的目的和本旨、促进商
《软件困局》内容简介:软件工程其实并没有多少“工程”的成分,这已经是公开的秘密了。自计算机诞生以来,特别是20世纪60年代大批
《马克思主义理论简明读本》内容简介:本书是由武汉理工大学马克思主义学院组织编写的,内容涵盖两个板块,一是马克思主义基本原理
《中国经学史十讲》内容简介:“经”原先只是指代一种纺织工艺,在漫漫历史长河中,其逐渐变成了唯指孔子亲授的儒家五经的专称。朱
《斯坦福社会创新评论09》内容简介:区块链、人工智能、3D打印等在给社会创新领域带来新的发展,是否也触发了科技的“黑暗面”?技
数据库原理应用与实践SQL Server 2012 本书特色 《数据库原理应用与实践(SQL Server2012)》由贾铁军、甘泉主编,本书主要突出“实用、特...
《总体设计》是已故美国著名城市规划师,麻省理工学院教授凯文·林奇的经典著作。书中包括对总体设计程序的论述,典型实例的分析
《超越财务报告内部控制:中国经验》内容简介:本书在梳理与比较中美两国企业内部控制目标导向的历史演进过程、理论探讨不同目标导