This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.
To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.
Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.
●Obtain data from websites, APIs, databases, and spreadsheets
●Perform scrub operations on plain text, CSV, HTML/XML, and JSON
●Explore data, compute descriptive statistics, and create visualizations
●Manage your data science workflow using Drake
●Create reusable tools from one-liners and existing Python or R code
●Parallelize and distribute data-intensive pipelines using GNU Parallel
●Model data with dimensionality reduction, clustering, regression, and classification algorithms
Chapter 1 Introduction
Overview
Data Science Is OSEMN
Intermezzo Chapters
What Is the Command Line?
Why Data Science at the Command Line?
A Real-World Use Case
Further Reading
Chapter 2 Getting Started
Overview
Setting Up Your Data Science Toolbox
Essential Concepts and Tools
Further Reading
Chapter 3 Obtaining Data
Overview
Copying Local Files to the Data Science Toolbox
Decompressing Files
Converting Microsoft Excel Spreadsheets
Querying Relational Databases
Downloading from the Internet
Calling Web APIs
Further Reading
Chapter 4 Creating Reusable Command-Line Tools
Overview
Converting One-Liners into Shell Scripts
Creating Command-Line Tools with Python and R
Further Reading
Chapter 5 Scrubbing Data
Overview
Common Scrub Operations for Plain Text
Working with CSV
Working with HTML/XML and JSON
Common Scrub Operations for CSV
Further Reading
Chapter 6 Managing Your Data Workflow
Overview
Introducing Drake
Installing Drake
Obtain Top Ebooks from Project Gutenberg
Every Workflow Starts with a Single Step
Well, That Depends
Rebuilding Specific Targets
Discussion
Further Reading
Chapter 7 Exploring Data
Overview
Inspecting Data and Its Properties
Computing Descriptive Statistics
Creating Visualizations
Further Reading
Chapter 8 Parallel Pipelines
Overview
Serial Processing
Parallel Processing
Distributed Processing
Discussion
Further Reading
Chapter 9 Modeling Data
Overview
More Wine, Please!
Dimensionality Reduction with Tapkee
Clustering with Weka
Regression with SciKit-Learn Laboratory
Classification with BigML
Further Reading
Chapter 10 Conclusion
Let’s Recap
Three Pieces of Advice
Where to Go from Here?
Getting in Touch
拍卖之王从0到190亿美金的2100天。EBAY,1995年从创始人皮埃尔·奥米达的梦想出发,如今已然成为:全球最大的电子交易市场,全球
《动静之美》内容简介:《动静之美——Sketch移动UI与交互动效设计详解》全面、细致地介绍了Sketch软件的使用方法,以及和Sketch软
《黑客与画家(10万册纪念版)》内容简介:你无须改变太多,也能活出精彩一生。这是保罗给我们的心理安抚,同时他也用自己在世俗意
《Solidity编程:构建以太坊和区块链智能合约的初学者指南》内容简介:本书一方面从概念上介绍了Solidity编程语言,比如从以太坊虚
《刘晓蕾《红楼梦》十二讲》内容简介:读过和没有读过《红楼梦》,生命中有些东西是不一样的。《红楼梦》无疑是阻隔在读者面前的一
"GISforSustainableDevelopment"examineshowGISapplicationscanimprovecollaborationi...
《司法的长期主义》内容简介:本书的内容体现了最新成长起来的法律人的思维方式,代表了逐渐掌握话语权的新生代法律人的法治理念。
《知道点世界文化》内容简介:什么是摩西“十诫”?蒙娜丽莎的微笑背后隐藏着什么样的秘密?“投身饲虎”是一个怎样的故事……这些
Inlively,mordantlywittyprose,Negropontedecodesthemysteries--anddebunksthehype--s...
AlistairCockburn是用例方面的一位著名专家。他是HumansandTechnology公司的资深顾问,在那里他负责帮助客户在面向对象项目上获得
《图像处理、分析与机器视觉(第3版)》是为计算机专业图像处理、图像分析和机器视觉课程编写的教材。《图像处理、分析与机器视觉(
《普通高等教育十一五国家级规划教材•通信软件设计基础》针对通信软件和通信协议的特点,从通信协议的分析、设计和描述人手,系统
《知识图谱与认知智能:基本原理、关键技术、应用场景与解决方案》内容简介:读者通过本书可以了解企业认知智能的原理、应用方法、
《新媒体艺术之互动影像装置艺术》内容简介:互动影像装置艺术是国际上新兴起的一门艺术形式,它属于包含范围更广的新媒体艺术。
《服务设计:用极致体验赢得用户追随》内容简介:服务设计不只是设计服务,更是设计与服务相关的整个商业系统。服务设计包括表层的
《怎样成为一名设计师》内容简介:头脑独立的设计师需要自己发现合理建议,也需要当代设计师面对的道德与实践问题的指南。《怎样
《书法美育的经典图释》内容简介:本书为陈振濂书法美育思想的“图像篇”,是针对书法美育的一次力行实践,对书法美育的推广与普及
《会带人,才有高效团队》内容简介:针对管理者在打造团队时的困惑,作者通过多年的研究与实践,总结出一个适用于高效团队的衡量标
创业需要好的设计,精益创业的用户体验设计是一种更快更智能的用户体验设计方法。本书讲述了众多精益用户体验设计的特点,通过多
本书是“当代最了不起的科学家作家”卡斯蒂讲述仿真学的力作。作为正在引发科学革命的计算机仿真,不是基于直接观察实验,而是基