This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.
To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.
Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.
●Obtain data from websites, APIs, databases, and spreadsheets
●Perform scrub operations on plain text, CSV, HTML/XML, and JSON
●Explore data, compute descriptive statistics, and create visualizations
●Manage your data science workflow using Drake
●Create reusable tools from one-liners and existing Python or R code
●Parallelize and distribute data-intensive pipelines using GNU Parallel
●Model data with dimensionality reduction, clustering, regression, and classification algorithms
Chapter 1 Introduction
Overview
Data Science Is OSEMN
Intermezzo Chapters
What Is the Command Line?
Why Data Science at the Command Line?
A Real-World Use Case
Further Reading
Chapter 2 Getting Started
Overview
Setting Up Your Data Science Toolbox
Essential Concepts and Tools
Further Reading
Chapter 3 Obtaining Data
Overview
Copying Local Files to the Data Science Toolbox
Decompressing Files
Converting Microsoft Excel Spreadsheets
Querying Relational Databases
Downloading from the Internet
Calling Web APIs
Further Reading
Chapter 4 Creating Reusable Command-Line Tools
Overview
Converting One-Liners into Shell Scripts
Creating Command-Line Tools with Python and R
Further Reading
Chapter 5 Scrubbing Data
Overview
Common Scrub Operations for Plain Text
Working with CSV
Working with HTML/XML and JSON
Common Scrub Operations for CSV
Further Reading
Chapter 6 Managing Your Data Workflow
Overview
Introducing Drake
Installing Drake
Obtain Top Ebooks from Project Gutenberg
Every Workflow Starts with a Single Step
Well, That Depends
Rebuilding Specific Targets
Discussion
Further Reading
Chapter 7 Exploring Data
Overview
Inspecting Data and Its Properties
Computing Descriptive Statistics
Creating Visualizations
Further Reading
Chapter 8 Parallel Pipelines
Overview
Serial Processing
Parallel Processing
Distributed Processing
Discussion
Further Reading
Chapter 9 Modeling Data
Overview
More Wine, Please!
Dimensionality Reduction with Tapkee
Clustering with Weka
Regression with SciKit-Learn Laboratory
Classification with BigML
Further Reading
Chapter 10 Conclusion
Let’s Recap
Three Pieces of Advice
Where to Go from Here?
Getting in Touch
Firstpublishedsevenyearsago-justbeforetheWorldWideWebexplodedintodominanceinthes...
ADAMS入门详解与实例-第2版-(含光盘) 本书特色 李增刚编著的《adams入门详解与实例(附光盘第 2版)》以adams 2013版为基础,主要介绍如何在...
本书通过对一个社交网络RailsSpace开发过程的介绍,详细地展示了流行的Web应用程序开发框架RubyonRails的配置和使用方法。本书循
InthetraditionofWhoOwnstheFuture?andTheSecondMachineAge,anMITMediaLabscientistim...
""HowtoThinkLikeaComputerScientist""isanintroductiontoprogrammingusingPython,one...
《“智能+”制造:企业赋能之路》内容简介:本书将“智能+”赋能制造诠释为“互联网+”“大数据+”和“人工智能+”制造,即新一代信
《Rhino7犀利建模》内容简介:本书是由长沙卓尔谟教育科技有限公司编写的一部以Rhino7(犀牛软件)建模方法教学为核心的综合性教程
韩国的网页设计风格在世界上的影响越来越大,也被越来越多的人所喜爱和接受。其优秀的设计风格能让人看后有耳目一新的视觉感受。
《榨书:主动成长的高回报读书法》内容简介:本书首先深入浅出地介绍了作者独创的“榨书”阅读法,提倡最大化地榨取书的价值,通过
《海德格尔域性时间思想研究》内容简介:本书主要采用文本细读、与其他论著作者进行对话、对时间思想史进行梳理的方式来研究海德格
《大话社交网络》内容简介:本书是一本关于社交网络的幽默科普读物,它使用大量的漫画、故事、笑话、网络流行语、相声小品台词等生
C#课程设计案例精编 内容简介 本书从实际应用出发,详细介绍了使用C*开发.NET应用程序的方法。书中的9 个案例由浅入深、从简单到复杂地介绍了使用C*开发程序...
知识就是力量,信息就是能量,数据就是变量。本书全面阐述了人类从IT时代走向DT时代的基本特征和规律。《DT时代》认为,大数据正
杨树云中国著名化装艺术家。以整体塑造古代造型著称,因其丰富的实践经验、扎实的理论基础和深厚的文化底蕴,素有“天下第一梳”
《Shell脚本专家指南》旨在为Linux、Unix以及OSx系统管理员提供短小精悍且功能强大的shell实现解决方案,教会读者如何使用现有调
本书探讨了针对Ajax、JavaScript和基于表现状态传输(RepresentationalStateTransfer,REST)的Webservice,以...
"Doyoulearnbestbyexampleandexperimentation?Thisbookisideal.Haveyourfavoriteedito...
《情绪聚焦疗法的刻意练习》内容简介:近年来,心理治疗的刻意练习得到广泛的关注,其对心理治疗效果的预测得到了相当程度的验证,
主板维修技能实训 本书特色 《主板维修技能实训(附光盘)》由专业维修工程师王红军根据多年实践经验精心编写,重点讲解了电脑主板的结构、原理及故障维修诊断方法,主要...
《3GPP核心网技术》从网络结构、关键技术以及业务的角度,对3GPP组织制定的第三代移动通信系统WCDMA的核心网演进、核心网结构、核