This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.
To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.
Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.
●Obtain data from websites, APIs, databases, and spreadsheets
●Perform scrub operations on plain text, CSV, HTML/XML, and JSON
●Explore data, compute descriptive statistics, and create visualizations
●Manage your data science workflow using Drake
●Create reusable tools from one-liners and existing Python or R code
●Parallelize and distribute data-intensive pipelines using GNU Parallel
●Model data with dimensionality reduction, clustering, regression, and classification algorithms
Chapter 1 Introduction
Overview
Data Science Is OSEMN
Intermezzo Chapters
What Is the Command Line?
Why Data Science at the Command Line?
A Real-World Use Case
Further Reading
Chapter 2 Getting Started
Overview
Setting Up Your Data Science Toolbox
Essential Concepts and Tools
Further Reading
Chapter 3 Obtaining Data
Overview
Copying Local Files to the Data Science Toolbox
Decompressing Files
Converting Microsoft Excel Spreadsheets
Querying Relational Databases
Downloading from the Internet
Calling Web APIs
Further Reading
Chapter 4 Creating Reusable Command-Line Tools
Overview
Converting One-Liners into Shell Scripts
Creating Command-Line Tools with Python and R
Further Reading
Chapter 5 Scrubbing Data
Overview
Common Scrub Operations for Plain Text
Working with CSV
Working with HTML/XML and JSON
Common Scrub Operations for CSV
Further Reading
Chapter 6 Managing Your Data Workflow
Overview
Introducing Drake
Installing Drake
Obtain Top Ebooks from Project Gutenberg
Every Workflow Starts with a Single Step
Well, That Depends
Rebuilding Specific Targets
Discussion
Further Reading
Chapter 7 Exploring Data
Overview
Inspecting Data and Its Properties
Computing Descriptive Statistics
Creating Visualizations
Further Reading
Chapter 8 Parallel Pipelines
Overview
Serial Processing
Parallel Processing
Distributed Processing
Discussion
Further Reading
Chapter 9 Modeling Data
Overview
More Wine, Please!
Dimensionality Reduction with Tapkee
Clustering with Weka
Regression with SciKit-Learn Laboratory
Classification with BigML
Further Reading
Chapter 10 Conclusion
Let’s Recap
Three Pieces of Advice
Where to Go from Here?
Getting in Touch
《杨司令的少先队》内容简介:本书收录了郭墟的四部小说;《杨司令的少先队》反映东北抗日联军司令杨靖宇领导下的一支儿童战斗队朝
本书精选了近1000个国外不同风格的网页,按色彩分成十个色系。每个色系先简要介绍该色系的特点与意义,列出该色系的常用搭配关系
《猫派》内容简介:◆十二则刊载于《纽约客》杂志的“现象级”短篇小说 ◆网络上超三百万人次转发,HBO即将改编影视 ◆人的情感,真
《C++编程规范:101条规则准则与最佳实践》中,两位知名的C++专家将全球C++界20年的集体智慧和经验凝结成一套编程规范。这些规范可
《Kotlin实战》内容简介:本书将从语言的基本特性开始,逐渐覆盖其更多的高级特性,尤其注重讲解如何将Koltin集成到已有Java工程实
《企业会计准则原文、应用指南案例详解(2023年版)》内容简介:企业会计准则是会计从业人员进行会计确认、会计计量、会计报告的基
PaulRandwasoneoftheworldsleadinggraphicdesigners.Herehedescribeshisworkwiththesa...
这是一本剖析Linux常用目录及文件的专著,它打破以往图书偏重介绍命令语句的惯性思维,以系统目录架构为主体,并设计查询功能,以
《Spring技术内幕》内容简介:《Spring技术内幕:深入解析Spring架构与设计原理(第2版)》从源代码的角度对Spring的内核和各个主要功
《Web性能实战》内容简介:在Web变得越来越复杂的时代,解决Web性能问题正当时。本书旨在帮助读者创建更加快速的网站,内容涵盖Web
《股民的眼泪》内容简介:股市惨淡,跌跌不休,不少股民暗自垂泪,究竟如何避开股市的地雷? 张化桥继续敢言作风,在新作中大胆揭示
《老"码"识途:从机器码到框架的系统观逆向修炼之路》以逆向反汇编为线索,自底向上,从探索者的角度,原生态地刻画了对系统机制的
《应急响应》内容简介:本书的内容将前沿的网络安全应急响应理论与一线实战经验相结合,从科普角度介绍网络安全应急响应基础知识。
由塞贝尔编著的《实用CommonLisp编程》是一本不同寻常的CommonLisp入门书。《实用CommonLisp编程》首先从作者的学习经过及语言历
本书是机器学习原理和算法编码实现的基础性读物,内容分为两大主线:单个算法的原理讲解和机器学习理论的发展变迁。算法除包含传
计算机网络安全教程-(修订本) 本书特色 《计算机网络安全教程》(修订本)在原书基础上做了大量修整和扩充,使之更加适合高校教学和自学的需要。利用大量的实例讲解知...
《OpenStack设计与实现》是一本介绍OpenStack设计与实现原理的书。《OpenStack设计与实现》以Juno版本为基础,覆盖了OpenStack的
SAP从入门到精通 本书特色 源于实践 成就行家上海软件行业协会 秘书长 杨根兴、江苏省软件行业协会副会长 徐雷鼎力推荐6个核心模块的深入解析50个疑难解答和实...
Ifyouveeverbeenintroducedtoanewemployeeattheofficeas"theguywhogotdrunkattheChris...
本书针对媒体、通讯、传感等信息技术发展的需要,提出在未来的数字传播中,影像技术与交互设计将成为设计学科中重要的学习内容。