This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.
To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.
Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.
●Obtain data from websites, APIs, databases, and spreadsheets
●Perform scrub operations on plain text, CSV, HTML/XML, and JSON
●Explore data, compute descriptive statistics, and create visualizations
●Manage your data science workflow using Drake
●Create reusable tools from one-liners and existing Python or R code
●Parallelize and distribute data-intensive pipelines using GNU Parallel
●Model data with dimensionality reduction, clustering, regression, and classification algorithms
Chapter 1 Introduction
Overview
Data Science Is OSEMN
Intermezzo Chapters
What Is the Command Line?
Why Data Science at the Command Line?
A Real-World Use Case
Further Reading
Chapter 2 Getting Started
Overview
Setting Up Your Data Science Toolbox
Essential Concepts and Tools
Further Reading
Chapter 3 Obtaining Data
Overview
Copying Local Files to the Data Science Toolbox
Decompressing Files
Converting Microsoft Excel Spreadsheets
Querying Relational Databases
Downloading from the Internet
Calling Web APIs
Further Reading
Chapter 4 Creating Reusable Command-Line Tools
Overview
Converting One-Liners into Shell Scripts
Creating Command-Line Tools with Python and R
Further Reading
Chapter 5 Scrubbing Data
Overview
Common Scrub Operations for Plain Text
Working with CSV
Working with HTML/XML and JSON
Common Scrub Operations for CSV
Further Reading
Chapter 6 Managing Your Data Workflow
Overview
Introducing Drake
Installing Drake
Obtain Top Ebooks from Project Gutenberg
Every Workflow Starts with a Single Step
Well, That Depends
Rebuilding Specific Targets
Discussion
Further Reading
Chapter 7 Exploring Data
Overview
Inspecting Data and Its Properties
Computing Descriptive Statistics
Creating Visualizations
Further Reading
Chapter 8 Parallel Pipelines
Overview
Serial Processing
Parallel Processing
Distributed Processing
Discussion
Further Reading
Chapter 9 Modeling Data
Overview
More Wine, Please!
Dimensionality Reduction with Tapkee
Clustering with Weka
Regression with SciKit-Learn Laboratory
Classification with BigML
Further Reading
Chapter 10 Conclusion
Let’s Recap
Three Pieces of Advice
Where to Go from Here?
Getting in Touch
FOLLOWTHESUNTOMOREEVILFUN!Letthesunshineonyourevilside-andhaveawickedamountoffun...
《绞刑架下的报告:增订版》内容简介:《绞刑架下的报告》这部纪实文学作品,是伏契克在纳粹德国盖世太保监狱里万分艰难的处境中,
关于作者JasonMcC.Smith,2005年毕业于北卡罗莱纳州立大学教堂山分校,获计算机科学博士学位。该校也是元素模式的诞生地,元素模
Pro/ENGINEER Wildfire3.0高级实例1CD 内容简介 本书在循序渐进的教学中,通过精选的实际产品案例讲解了Pro/ENGINEER Wild...
《叶兆言散文》内容简介:叶兆言的散文以博识、才学、智趣见长。在他的笔下,家庭生活、读书、采风、故交等皆可成文,厚实的人文功
Web GIS-原理与应用 本书特色 Web GIS综合了Web和GIS的优点,它让广大的互联网用户认识到了地理信息系统这一领域及其巨大的价值,为现代信息系统引...
本书基于MIT(麻省理工学院)的一门课程写成,主要目标是帮助读者掌握并熟练使用各种计算技术。本书涵盖了Python的大部分特性,重
《SEO流量狙击:搜索优化面面观》内容简介:本书站在企业的立场,以效果为导向,通过浅显易懂的叙述方式,从营销型网站定位及策划开
机器学习理论及应用 本书特色 《机器学习理论及应用》:当代科学技术基础理论与前沿问题研究丛书:中国科学技术大学校友文库。机器学习理论及应用 内容简介 机器学习新...
《出发!可爱的虫虫世界》内容简介:“蛋蛋学校万物探秘之旅”是国内原创的一套极富趣味性和知识性的探索万物的科普漫画绘本。讲述
VisualBasic程序教程设计 内容简介 本书主要有以下特点:1.重点讲解可视化编程方法;传统编程与可视化编程有较大的区别,按传统的“纯”结构化方法编程,自...
《构建高性能Web站点》内容简介:本书围绕如何构建高性能Web站点,从多个方面、多个角度进行了全面的阐述,涵盖了Web站点性能优化的
《大脑功能模式(型)理论》内容简介:在主客观事物情境模式不断刺激下,基因遗传决定的大脑生理结构和特性,建构各不相同的生理结
《SaaS创业之路》内容简介:这是一本面向SaaS赛道相关从业者的书,内容不只是聚焦在产品、运营、经营等某个具体模块,还从更体系化
"HTML5andJavaScriptWebApps"isaboutbuildingwebapplicationswithHTML5andW3Cspecific...
媒体推荐“EverybusinessleaderIknowworriesaboutthesamething:Arewemovingfastenough?Theg...
中文版3ds Max 2016完全自学教程 本书特色 本书是一本全面介绍中文版3ds Max 2016基本功能及实际运用的书。本书完全针对零基础读者编写,是入门...
本书是一本广受赞誉的C#教程。它以图文并茂的形式,用朴实简洁的文字,并辅之以大量表格和代码示例,精炼而全面地阐述了最新版C#
机械CAD/CAM技术-第3版 内容简介 本书系统地讲述了机械cad/cam的基本概念、应用方法和关键技术。主要内容包括cad/cam系统工作原理、软硬件支撑环...
内容简介本书全面系统地论述了信号与系统分析的基本理论和方法。全书共11章,内容包括:信号与系统、线性时不变系统,周期信号的