This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.
To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.
Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.
●Obtain data from websites, APIs, databases, and spreadsheets
●Perform scrub operations on plain text, CSV, HTML/XML, and JSON
●Explore data, compute descriptive statistics, and create visualizations
●Manage your data science workflow using Drake
●Create reusable tools from one-liners and existing Python or R code
●Parallelize and distribute data-intensive pipelines using GNU Parallel
●Model data with dimensionality reduction, clustering, regression, and classification algorithms
Chapter 1 Introduction
Overview
Data Science Is OSEMN
Intermezzo Chapters
What Is the Command Line?
Why Data Science at the Command Line?
A Real-World Use Case
Further Reading
Chapter 2 Getting Started
Overview
Setting Up Your Data Science Toolbox
Essential Concepts and Tools
Further Reading
Chapter 3 Obtaining Data
Overview
Copying Local Files to the Data Science Toolbox
Decompressing Files
Converting Microsoft Excel Spreadsheets
Querying Relational Databases
Downloading from the Internet
Calling Web APIs
Further Reading
Chapter 4 Creating Reusable Command-Line Tools
Overview
Converting One-Liners into Shell Scripts
Creating Command-Line Tools with Python and R
Further Reading
Chapter 5 Scrubbing Data
Overview
Common Scrub Operations for Plain Text
Working with CSV
Working with HTML/XML and JSON
Common Scrub Operations for CSV
Further Reading
Chapter 6 Managing Your Data Workflow
Overview
Introducing Drake
Installing Drake
Obtain Top Ebooks from Project Gutenberg
Every Workflow Starts with a Single Step
Well, That Depends
Rebuilding Specific Targets
Discussion
Further Reading
Chapter 7 Exploring Data
Overview
Inspecting Data and Its Properties
Computing Descriptive Statistics
Creating Visualizations
Further Reading
Chapter 8 Parallel Pipelines
Overview
Serial Processing
Parallel Processing
Distributed Processing
Discussion
Further Reading
Chapter 9 Modeling Data
Overview
More Wine, Please!
Dimensionality Reduction with Tapkee
Clustering with Weka
Regression with SciKit-Learn Laboratory
Classification with BigML
Further Reading
Chapter 10 Conclusion
Let’s Recap
Three Pieces of Advice
Where to Go from Here?
Getting in Touch
KH10067 Photoshop CS5影像圣经 本书特色 本书编写的目的,是为读者解析经典软件photoshop的使用方法,提供操作上的重点和难点信息,实战...
《世界著名计算机教材精选:语义Web技术基础》主要介绍了语义万维网基础技术。《世界著名计算机教材精选:语义Web技术基础》从实用
作者简介:SimonBrown全球知名软件架构独立咨询师、讲师,创办了专门讨论软件架构问题的网站“编码架构”(codingthearchitectur
Despitethehugenumberofmobiledevicesandappsinusetoday,yourbusinessstillneedsawebs...
《如戏》内容简介:人生如戏,戏如人生,人生与戏曲互为镜像,于人生中寻觅戏曲意境,于戏曲中印证人生苦乐。这本随笔集是一份女性
Pro/Engineer Wildfire3.0基础设计与实践-(含光盘) 本书特色 本书首先以机械零件的建立为例提出问题,然后结合建模理论分析问题,再通过建模...
《电信市场经营分析方法与案例》包括基础理论(1-3章)、分析实战(4-8章)和分析工具(9-11章)三大部分。第1章经营分析概述。界
《趣学Python算法100例》内容简介:本书从一些经典算法出发,为读者展示了100个Python趣味编程实例。本书共12章,涵盖趣味算法入门
《计算机科学概论(Python版)》内容简介:本书是美国哈维玛德学院“计算机科学通识”课程的配套教材,用独特的方法介绍计算机科学
本书是最具知名度的Linux入门书《鸟哥的Linux私房菜基础学习篇》的最新版,全面而详细地介绍了Linux操作系统。全书分为5个部分:
《破绽:风口上的独角兽》内容简介:互联网时代是一个英雄不问出处的草莽时代。这个时代造就了一大批独角兽公司和新兴业态,它们出
《化学会呼吸》内容简介:本书基于全国重点中学资深化学老师多年累积的经验和研究成果,以活泼又不失严谨的写作风格,用身边可触摸
《财神的名单》内容简介:跟你讲的是18个商业大人物成功背后的故事。阅读着他们的精彩,既丰富了谈资,也可以思考一下自己的人生。
CoreIDRAW X6平面设计与制作深度剖析-突破平面-含DVD 本书特色《突破平面coreldrawx6平面设计与制作深度剖析(附光盘平面设计与制作全彩印刷...
《家庭电工一本通(第2版)》内容简介:本书内容包括家庭安全用电早知道,家庭用电必备工具,家庭电工看图,家庭照明安装,家用电器
《让品牌说话:品牌营销高效准则》内容简介:这本书从“道”的角度说起,先从品牌营销的本质核心入手,让读者心中能有一个框架逻辑
本书是国际算法大师乌迪·曼博(UdiManber)博士撰写的一本享有盛誉的著作。全书共分12章:第1章到第4章为介绍性内容,涉及数学归
《互联网+幼儿园管理》内容简介:本书通俗讲解互联网的基本概念,阐述移动互联网、云计算、智能设备的重要特征,便于读者快速了解互
Matlab神经网络与应用(第2版) 内容简介 Matlab语言是MathWorks公司推出的一套高性能计算机编程语言,集数学计算、图形显示、语言设计于一体,其...
《文化·建造·自然:当代建筑理论课五题》内容简介:这是一本当代建筑理论课程的学生论文集,涵盖形式、建造、城市、自然、文化观