• View by Day

    Day One

    Sunday, April 8

    8:00AM

    Breakfast/Registration

    4th Floor Lobby

    9:00AM-12:30PM

    • Tutorial 1: Introduction to Machine Learning with scikit-learn

      David Mertz

      In this hands-on tutorial, we will explore machine learning examples (supervised and unsupervised) using the scikit-learn package. Some prior experience with Python tools for data science (e.g., Numpy, Pandas, Matplotlib) is recommended.

    • Tutorial 2: Packaging in the conda Ecosystem

      Michael Sarahan

      It's hard to keep multiple moving parts of Python environments all playing nicely together. This tutorial will cover Python packaging, including how incompatibilities arise, and tools and techniques for keeping your software working.

    12:30PM

    Lunch (provided)

    1:30PM-5:00PM

    • Tutorial 3: Up & Running with Anaconda Enterprise

      Kris Overholt & Daniel Rodriguez

      Participants will install and configure the Anaconda Enterprise data science platform while considering cluster sizing, operational tasks, and use cases. Then we'll perform hands-on data science lifecycle work, including implementing machine learning models with scikit-learn and Tensorflow, and moving from development to production by training models in notebooks and deploying REST APIs.

    • Tutorial 4: Practical Data Science and ML with GPUs

      Stan Seibert

      Learn how to use Anaconda to accelerate deep learning with GPUs using Keras and TensorFlow. This tutorial will walk you through data preparation, basic model design, training and model evaluation, and tips for model deployment.

    5:00PM

    6:00PM

    Break

    6:00PM

    8:00PM

    Opening Reception

    4th Floor Lobby

    Day Two

    Monday, April 9

    9:00AM

    10:00AM

    Breakfast/Registration

    4th Floor Lobby

    10:00AM

    10:50AM

    • Opening Keynote

      Peter Wang, Co-Founder & CTO of Anaconda

      Peter will kick things off with an exploration of the co-evolution of data science, data-driven business, and the open source Python community. He then will share how data scientists can help their companies better understand the power and limitations of their work, especially in today's noisy and frothy technology landscape.

    11:00AM

    11:50AM

    • Quick and Easy TensorFlow on AE5

      Anaconda Michael Grant

      In this talk, we will show you how to make the most of TensorFlow under Anaconda Enterprise 5. Creating AE5 projects with TensorFlow is a snap, thanks to the Anaconda packaging ecosystem. We'll show you how to take and a la carte approach to its capabilities—for instance, by feeding data directly from your Pandas-based pipelines into a TensorFlow model. And of course, we'll give you an overview of the learning methods available in the package, and demonstrate several of them in operation on AE5.

    • Fast Feature Evaluation with MCPT and Numba

      Real WorldDavid Patschke

      One of the challenges when working with high-dimensional data is quickly being able to find the independent variables that most strongly influence the dependent variable. Unfortunately, the higher the dimensionality, the more likely that any influential independent variable selected may be influential by random luck. Enter Monte Carlo Permutation Testing (MCPT). By permuting the dependent variable many times, calculating information measures on these permutations, and comparing these measures to the actual information measure, a practitioner can be more confident that a selected variable truly will be informative. Until recently, the ability to execute something like this efficiently within Python was rather challenging. However, with the recently added ParallelAccelerator functionality within Numba, this can be executed in a single memory space at a blisteringly fast pace all in native Python. The goal of this talk is not only to introduce the concept of MCPT, but to inspire others to explore using the ParallelAccelerator functionality within Numba for significant speed advantages.

    • Deep Learning with Just a Little Bit of Data

      Open Source Michael Bernico

      There’s no question that deep learning is changing the field of machine learning at an extremely rapid pace. Given enough data, deep learning can solve problems we couldn’t imagine just a few years ago. But what do we do when there isn’t enough data? Can we still apply deep learning when we only have hundreds, or thousands of data points? In this talk we will discuss doing deep learning with very little data. We will discuss the topic of transfer learning, which we find to be immensely useful for the business applications of deep learning. Finally, we will present some original research that shows just how far we can go with transfer learning on very small volumes of data.

    11:50AM

    1:00PM

    Lunch (provided) and Sponsor Showcase

    1:00PM

    1:50PM

    • Deploying Python and R to Spark and Hadoop

      Anaconda Daniel Rodriguez

      Anaconda Enterprise provides an easy yet powerful architecture that allows people to connect from interactive sessions and deployments to running Spark clusters from Python and R.  We will take a look at the Anaconda Enterprise 5 architecture for connecting to Hadoop/Spark clusters powered by Sparkmagic and Apache Livy (incubating), while taking a look at the benefits of this architecture and how it allows users to securely and easily connect to remote Hadoop/Spark clusters. We also will look at how Anaconda Enterprise enables users to do runtime distribution of custom Anaconda installers using Cloudera Parcels and Ambari Management Packs, allowing data scientists to ship Anaconda environments and leverage libraries from Anaconda.

    • Achoo: Using Machine Learning to Fight My Son’s Asthma

      Real World Tim Dobbins

      Achoo uses a Raspberry Pi to predict if Tim’s son will need his inhaler on any given day using weather, pollen, and air quality data. If the prediction for a given day is above a specified threshold, the Pi will email both Tim and the school nurse, notifying her that he may need preemptive treatment. The system is designed to be language-agnostic with regard to the predictive models used. The backend is built with Python/Flask.

    • DUKE: Dataset Understanding through Knowledge-base Embeddings

      Open Source David Sullivan

      "DUKE: Dataset Understanding through Knowledge-base Embeddings" produces abstractive descriptions of datasets based on word2vec model trained on wikipedia paired with a curated ontology. For those familiar with word2vec, you can think of DUKE as essentially "dataset2vec". This talk will discuss the technology behind DUKE, how DUKE can be used to improve the data science and data engineering process, and how the audience can download and use the software.

    2:00PM

    2:50PM

    • Architecting AE5 Deployments

      AnacondaKris Overholt

      We’ll discuss top-of-mind administrative considerations for implementing Anaconda Enterprise in your environments and how to move between multiple IT environments, such as: testing, QA, and production. Discover best practices for network, security, and performance that allows Anaconda Enterprise to connect to existing infrastructure, security, and compute/storage resources.

    • Accelerating Scientific Workloads with Numba

      Real WorldSiu Kwan Lam

      This talk introduces Numba and demonstrates how its core features (e.g. JIT compilation, automatic parallelization, GPU programming) can speed up scientific workloads with minimal code changes. We also will discuss the upcoming features of Numba.

    • Getting Started with Anaconda Distribution

      Open SourceAlbert DeFusco

      What is it that has made the Anaconda Distribution so successful, and how can I make the best use of it? This talk will take you on a tour of the Anaconda Distribution and show you how to do powerful open source data science using the tools and libraries included in Anaconda.

    2:50PM

    3:10PM

    Afternoon Break and Sponsor Showcase

    3:10PM

    4:00PM

    • Enterprise Package Governance

      Anaconda Duane Lawrence

      While the benefits of open source are clear to data scientists, enterprise IT administrators have some concerns. Just who exactly authored these open source packages? How can we be sure that they are secure? When data science teams create their own packages, how can these internal tools be shared and governed securely? This talk will walk through best practices for enterprise package governance. IT admins will learn how to securely manage open source packages, strategies for whitelisting and blacklisting, and how to easily share internal packages securely.

    • Deep Learning Using Python on Anaconda Distribution to Identify Distracted Driver Behavior

      Real WorldSripriya Sundararaman

      I will present the experiences our team has had using Deep Learning on the Anaconda framework. Our solution involves building CNN models on 2D and 3D images of drivers to identify and classify distracted driving behavior. The talk will discuss the challenges, the algorithms we used, and python solutions for collecting data from the 3D camera sensors and 2D camera, preprocessing data, creating our models, and predicting outcomes.

    • Production-Grade Packaging with Anaconda

      Open Source Mahmoud Hashemi

      Anaconda always has been a powerful platform for data analysts and scientists across the Python world. The same reasons that make it work for those groups also apply to engineers building and shipping scalable services: easy access to prebuilt packages, including system packages not managed by pip, and other packages not conveniently provided by the operating system. This talk will cover using conda and conda envs in real-world industrial settings, what makes conda special for software engineers, and the challenges and goals of packaging. Mahmoud will provide real-world examples using Anaconda to build an OS package (RPM) and Docker images.

    4:10PM

    5:00PM

    • Using Advanced Analytics to Develop a Risk-Based Approach to Asset Management

      Anaconda Will Collins

      In this talk, Will is going to illustrate how, by embracing data science and open source tools within the Anaconda platform, National Grid is building a risk-based framework to effectively manage its Electricity Transmission assets. Attendees will learn how National Grid's Analytics team has used Python to drive efficiencies within the business, developing tools that allow engineers to streamline their processes.

    • Data Engineering for Data Scientists

      Real World Max Humber

      When models and data applications are pushed to production, they become brittle black boxes that can and will break. In this talk you’ll learn how to one-up your data science workflow with a little engineering! Or more specifically, about how to to improve the reliability and quality of your data applications... all so that your models won’t break (or at least won’t break as often)! Examples for this session will be in Python 3.6+ and will rely on: logging to allow us to debug and diagnose things while they’re running, Click to develop “beautiful” command line interfaces with minimal boiler-plating, and Pytest to write short, elegant, and maintainable tests.

    • Convolutional Neural Networks (CNNs): A Game-Changer for Computer Vision

      Open Source Tassos Sarbanes

      This talk will present a brief history of computer vision and processing’s evolution and revolution. We’ll review important studies including Larry Roberts’s Block World, the Summer Vision Project by MIT, Vision by David Marr, Explaining Visual Science by David Lowe, Normalized Cut by Shi & Malik, and Face Detection by Viola & Jones.  The focus of the talk will be the introduction of convolutional neural networks (CNNs) and their huge impact on the computer vision space. Code examples based on Jupyter Notebooks will be presented by implementing the PASCAL Visual Object and ImageNet (WordNet) datasets. We’ll also cover general topics in machine learning and deep learning related to visual recognition, such as object detection, action classification, and image captioning.

    Day Three

    Tuesday, April 10

    8:00AM

    9:00AM

    Breakfast/Registration

    4th Floor Lobby

    9:00AM

    9:50AM

    • Keynote Address

      John Kim, President of HomeAway

      Did you know that 800 million jobs will disappear in the next 13 years? Many people feel unprepared for the speed and magnitude of the changes that are ahead. In this keynote, Love in the Age of Machine Learning, John will explore how our humanity and the ability to love will be a competitive advantage and shape our future.

    10:00AM

    10:50AM

    • conda: Tips & Tricks

      AnacondaKale Franz

      Anaconda solved one of the most headache-inducing problems in data science—overcoming the dependency nightmare—through the power of conda. Powering Anaconda Distribution at its core, our open source, cross-platform package and environment manager is beloved by data scientists because it enables them to easily install and manage their favorite packages from Python and other languages. In addition, data scientists can create virtual conda environments, in which they are free to install Python, pip, and any number of other packages. This means a data scientist can easily flip between projects using Python 2 and Python 3, for example. Kale will demonstrate these key features and many more.

    • Building Better Badass Cars

      Real WorldPeter Buschbacher

      Cars are incredibly difficult to manufacture. The fracture between IT and Business has forced a lot of analytical development down the drain in the past. However, with current capabilities for data extraction, analysis, and computation, vehicle production is being continuously improved upon. This talk will focus on how analysis, questioning, and foundational data science have helped plant managers across the globe bring solutions to difficult problems in the manufacturing sphere.

    • Building a Data Science Team using Open Source Data Science

      Open SourceKatrina Riehl

      Open source data science technologies have changed the face of building and operating a data science organization. In this talk, Katrina will explore how and why open source technologies are necessary for the success of businesses hoping to use data science and machine learning to power innovation. She will discuss how HomeAway.com is using tools like Anaconda, conda, and other Python-powered open source libraries to change how they look at their market and stay competitive. She will also discuss her journey in making Python a first-class citizen in a traditionally Java-based organization while growing a data science team from the ground up.

    11:00AM

    11:50AM

    • Scalable Machine Learning with Dask

      AnacondaTom Augspurger

      Scikit-Learn, NumPy, and pandas form a great toolkit for single-machine, in- memory analytics. Scaling them to larger datasets can be difficult, as you have to adjust your workflow to use chunking or incremental learners. Dask provides NumPy- and pandas-like data containers for manipulating larger than memory datasets, and dask-ml provides estimators and utilities for modeling larger than memory datasets. These tools scale your usual workflow out to larger datasets. We’ll discuss some of the challenges data scientists run into when scaling out to larger datasets. We’ll then focus on demonstrations of how dask and dask-ml solve those challenges. We’ll see examples of how dask can expose a cluster of machines to scikit-learn’s built-in parallelization framework. We’ll see how dask-ml can train estimators on large datasets.

    • Machine Learning Crash Course

      Real World Samuel Taylor

      Machine learning is surrounded by so much hype it can seem like magic. Learn the math behind the magic in this whirlwind tour of machine learning. After spending a few minutes learning about machine learning theory, we'll jump right into practice with three different use cases: teaching a computer sign language (supervised learning); predicting hourly energy load in the state of Texas (time series/forecasting); and using machine learning to find your next job (recommender systems—content-based filtering). With each use case, we'll discover new techniques applicable in real-world machine learning problems.

    • GPU-Accelerating UDFs in PySpark with Numba and PyGDF

      Open SourceJoshua Patterson & Keith Kraus

      With advances in computer hardware such as 10 gigabit network cards, infiniband, and solid state drives all becoming commodity offerings, the new bottleneck in big data technologies is very commonly the processing power of the CPU. In order to meet the computational demand desired by users, enterprises have had to resort to extreme scale out approaches just to get the processing power they need. One of the most well known technologies in this space, Apache Spark, has numerous enterprises publicly talking about the challenges in running multiple 1000+ node clusters to give their users the processing power they need. This talk is based on work completed by NVIDIA’s Applied Solutions Engineering team. Attendees will learn how they were able to GPU-accelerate UDFs in PySpark using open source technologies such as Numba and PyGDF, the lessons they learned in the process, and how they were able to accelerate workloads in a fraction of the hardware footprint.

    11:50AM

    1:00PM

    Lunch (provided) and Sponsor Showcase

    1:00PM

    1:50PM

    • Access All of Your Data with Anaconda

      AnacondaAlbert DeFusco

      What do you do when you have data on Hadoop, SQL databases, CSV files, JSON files, SAS, and online? Learn how to utilize Anaconda to read data from multiple sources to build models and visualizations.

    • Learning From Our Learners

      Real WorldJonathan Cornelissen

      DataCamp is building the future of data science education. We’ll take a quick look at our learning interfaces and introduce our learning philosophy. Moreover, we’ll have a look at interesting data from DataCamp students (now over 2.3 million) and how we use that data to improve the learning experience. We’ll conclude the presentation with a deep dive into a new product we’re launching in collaboration with an amazing partner.

    • Real-Time Processing with Dask

      Open Source Matt Rocklin

      Dask is a tool for parallel and distributed processing in Python often known for parallelizing subsets of libraries like NumPy, Pandas, and Scikit-Learn. However, Dask also includes a low-level real-time task scheduler capable of asynchronous computation. This talk describes how to leverage the internal engine of Dask to build responsive distributed systems that react to real-world events with computation in a resilient and scalable manner. We will start with a foundational futures API, and build on that to include async-await functionality and streaming dataframes.

    2:00PM

    2:50PM

    • Model Management with Anaconda Enterprise

      Anaconda Michael Grant

      In this talk, we present an approach that leverages best practices of source code management combined with Anaconda's conda package management system to deliver a model management workflow sufficiently flexible to cover a wide range of applications. Our aim is to cover all but the most lightweight, loosely governed models. The features of our approach include: manageable separation of of training data, model infrastructure, and model design; leveraging source control for model versioning, approval processes, and experimental models; building conda packages for reliable, reproducible model deployment; and using Anaconda repositories to ensure historical model archiving.

    • What You Gonna Do With All That Malware: Malware Analysis and Machine Learning
      When You Can’t Fit It All On One Server

      Real World Austin West & Drew Bonasera

      MultiScanner is an open source malware analysis framework that assists the user in evaluating a set of files by automatically running a suite of tools and aggregating the output. The true power of this system is that it stores all the outputs from all of an analyst’s malware analysis tools in one highly performant, searchable, and scalable data store.  This talk will focus on one such analytic known as Exe-MANA. Exe-MANA is a deep neural network written entirely in Python for detecting if a Portable Executable file is malicious or benign using only static analysis. Exe-MANA is a great example of how easy it is to prototype data science techniques in Python with little to no experience in data science. Austin and Drew will go over the basic process for building Exe-MANA, and how they leverage MultiScanner to help speed up this process and continue the training process as they get new data.

    • DeepFashion: Building a REST API to Detect Clothing Styles

      Open SourcePaige Bailey

      In this hands-on tutorial, we’ll be using image recognition to take an existing deep learning model and adapt it to our own specialized domain (namely: guessing whether articles of clothing are preppy, sporty, punk, etc.). Instead of using a more data-intensive classifier, like a Residual Network, we’ll be using deep transfer learning to overcome our data scarcity problem and to build on top of an existing model. Once our transfer learning model has been trained, we’ll pack it up into a dockerized container (specifying inputs and outputs, as well as a score.py file), and then call it as a web service. We will also discuss a #DataOps process for refreshing the model as trends change over time. By the end of this talk, you (or at least your model!) will know how to select the perfect outfit for any occasion.

    2:50PM

    3:10PM

    Afternoon Break and Sponsor Showcase

    3:10PM

    4:00PM

    • PyViz: Dashboards for Visualizing 1 Billion Datapoints in 30 lines of Python in Your Browser on Your Laptop

      AnacondaJames Bednar

      James will present an overall workflow for building interactive dashboards visualizing billions of data points interactively in a Jupyter notebook, with graphical widgets allowing control over data selection, filtering, and display options—all using only a few dozen lines of code!

    • How to Make Your Data Scientists Happy - a Use-case Backed Approach for Enabling Data Science in the Enterprise

      Real World Hussain Sultan & Tim Horan

      Enabling data scientists within an enterprise requires a well-thought out approach from an organization, technology, and business results perspective. In this talk, Tim and Hussain will share common pitfalls to data science enablement in the enterprise and provide their recommendations to avoid them. Taking an example, actionable use case from the financial services industry, they will focus on how Anaconda plays a pivotal role in setting up big data infrastructure, integrating data science experimentation and production environments, and deploying insights to production. Along the way, they will highlight opportunities for leveraging open source and unleashing data science teams while meeting regulatory and compliance challenges.

    • Jumpstart Writing Continuous Applications with Structured Streaming Python APIs in Apache Spark

      Open Source Jules Damji

      We are in the midst of a Big Data Zeitgeist in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that reacts and interacts with data in real-time. We call this a continuous application. In this talk we will explore the concepts and motivations behind continuous applications and how Structured Streaming Python APIs in Apache Spark 2.x enables writing them. We also will examine the programming model behind Structured Streaming and the APIs that support them. Through a short demo and code examples, Jules will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historic data to perform advanced analytics using Spark SQL, DataFrames, and Datasets APIs.

    4:10PM

    5:00PM

    • How to Optimize Physical Assets Through ML-Powered Predictive Maintenance

      AnacondaSourav Dey & Rajendra Koppula

      With decreasing sensor, communication, storage and compute costs, it is now possible to collect vast amounts of data from physical assets like mechanical equipment, heavy earth moving equipment, and factory assembly lines. With recent advances, companies can now apply Machine Learning to all that data to optimize physical assets with the goal of achieving zero downtime. Using a real world case study, Sourav will demonstrate: 1) how to sample a time series dataset to generate training and validation datasets; 2) how to deal with non-uniformly sampled time series and other data quality issues; 3) how to use Dask to parallelize compute intensive feature creation to handle the volume and velocity of sensor data; 4) how to apply classic ML techniques like Random Forest trees; and 5) results and outcomes.

    • Apache Arrow: A Cross-Language Development Platform for In-Memory Data

      Real WorldWes Mckinney

      In this talk, Wes will discuss the ongoing community efforts since 2016 building Apache Arrow, a new open source project for high-performance analytics and data interoperability. Wes will explain the rationale for the project and the development that has occurred over the last two-and-a-half years. He will cover some downstream use cases in Apache Spark and other open source data processing projects, and also will look at the future roadmap for Arrow as it relates to data science and machine learning applications.

    • Building an Enterprise Analytics Competency Center (ACC)

      Open SourceAnn Manchella & Jim Ogle

      In this presentation, walk through PNC’s 2-year timeline as they journey from creating an Enterprise Data Management Group to a successful redevelopment of a Forecasting model in Python. Travel with PNC in their expedition from platform ownership nuances, empowering/training the LOB users to the Python, their Engagement Model with the lines of defense teams, the services offered by the Analytics team, the challenges with open source libraries and packages in a cyber-security-challenged banking industry, and much more.

    5:00PM

    6:30PM

    Break

    6:30PM

    10:30PM

    Shuttles Running between JW Marriott and Fair Market

    7:00PM

    10:00PM

    • AnacondaCON Carne Offsite Party @ Fair Market

      Located just east of downtown, Fair Market is the ideal setting for our AnacondaCON Carne offsite party, where you can network with your peers and experience some of the best food Austin has to offer. Popular food trucks Veracruz and Slab will be serving up Austin classics, and if you've saved any room, create your own s'mores for dessert! Local favorite DJ Chino Casino will be with us all night long to set the scene.

    Day Four

    Wednesday, April 11

    9:00AM

    10:00AM

    Breakfast/Registration

    4th Floor Lobby

    10:00AM

    10:50AM

    • Anaconda Distribution Roadmap

      Anaconda Crystal Soja

      We will cover the six parts of the Anaconda Distribution and how they are connected. These include the Anaconda and miniconda installers, repo.anaconda.com, anaconda.org, Anaconda Navigator, conda, and conda-build. We will then delve into the upcoming plans and release cadences for each aspect. We will also cover additional themes like signed packages and an improved conda/pip/wheels user experience that will impact multiple parts of the Anaconda Distribution.

    • IoT Predictive Maintenance using Recurrent Neural Networks

      Real World Justin Brandenburg

      The idea behind predictive maintenance is that the failure patterns of various types of equipment are predictable. If an organization can accurately predict when a piece of hardware will fail, and replace that component before it fails, it can achieve much higher levels of operational efficiency. With many devices now including sensor data and other components that send diagnosis reports, predictive maintenance using big data is increasingly accurate and effective. In this case, how can we enhance our data monitoring to predict the next event?  This talk will present an actual use case in the IoT industry 4.0 space. Justin will present an entire workflow of data ingestion, bulk ETL, data exploration, model training, testing, and deployment in a real time streaming architecture that can scale. He will demonstrate how he used Anaconda Python 3.5 and Pyspark 2.1.0 to wrangle data and train a recurrent neural network to predict whether the next event in a real time stream indicated that maintenance was required.

    • Accelerating Deep Learning with GPUs

      Open Source Stan Seibert

      GPU acceleration has become critical for deep learning as models have become more complex and datasets have grown in size. Although not initially designed for deep learning, GPUs were recognized early on as having an architecture well-adapted to speeding up the massively parallel array calculations at the heart of deep learning. Speedups of 10x or more during the training process are often seen using GPUs, and many models can be scaled up to use multiple GPUs. GPU manufacturers, like NVIDIA, are starting to release GPUs with deep learning-specific features to further speedup model training and improve the throughput of deployed models. Installing and deploying GPU accelerated code can be challenging, so Anaconda has curated the most popular deep learning frameworks and packed them with GPU acceleration in the Anaconda Distribution. There they can be combined with your favorite Python packages—including Pandas, Dask, and Jupyter—to power data science experiments and production deployments.

    11:00AM

    11:50AM

    • conda Deep Dive

      Anaconda Kale Franz

      In this talk, we'll dive deep into the guts of what makes conda so special. We'll look at the full life cycle of a conda package, the details of how a package is constructed, and the details of how a package gets installed into an environment. We’ll walk through the three stages of core conda operations, including the conda solver, and also how packages are transactionally installed into environments. This talk will give you a richer understanding of exactly what conda does, why conda does it, and how.

    • Setting Big Data on Fire: the FireCARES and NFORS projects

      Real World Craig Weinschenk

      Local government decision-makers often alter fire department resources faster than fire service leaders can evaluate the potential impact. These decisions can leave a community without sufficient resources to respond to emergency calls safely, efficiently, and effectively. The Fire Community Assessment/Response Evaluation System (FireCARES) provides fire departments the ability to add a technical basis to what historically has been an anecdotal discussion regarding community hazards and risks as well as the impact of changes to fire department resource levels. To accomplish this task, FireCARES provides three scores for each community based on the available data: the Community Risk Score, the Fire Department Performance Score, and the Safe Grade. These scores are generated from exploiting an expansive, multi-layered data set combining fire incidents, outcomes, and community risk characteristics. Fire incident data is not without flaws as it primarily relies on firefighters for data entry. Additionally, on the national level there is a two-year data lag. To overcome this obstacle, we have built the National Fire Operations Reporting System (NFORS), a real-time data analysis tool which leverages modern data practices while removing firefighters from data entry.

    • Beyond Neural Networks: What You're Missing Out On in TensorFlow

      Open Source Joseph Nelson

      Google's open source machine learning library, TensorFlow, is a well-known tool for building neural networks and is available via conda-forge. But ever since the 1.3 release (Fall 2017), Google has introduced many underutilized features that simplify the data science workflow. In this session, Joseph will introduce the high-level benefits of TensorFlow and walk through live examples exploring: (1) the Datasets API for seamlessly reading in datasets larger than memory; (2) TensorFlow Estimators, simple pre-packaged machine learning models comparable to those found in sklearn; and (3) TensorFlow Eager Execution Mode, which enables simpler debugging. Along the way, Joseph will solve real-world data science problems and highlight opportunities for open source contribution.

    11:50AM

    1:00PM

    Lunch

    1:00PM

    1:50PM

    • If You’re Not Doing A/B Testing with Advanced Analytics, You’re Doing It Wrong

      AnacondaKrissy Tripp

      By leveraging applied statistical methods such as hazard models, ANOVA, clustering, and regression, we can maximize our insights and the business impacts of A/B Testing. This talk will show you how advanced statistical methods can evolve every step of the testing process. We’ll use real-world business problems, A/B test situations, and Anaconda to illustrate how A/B testing and data science teams can collaborate to deliver amazing customer experiences, uncover novel consumer insights, and scale the process to enterprise levels.

    • Causal Inference in Tech

      Real World Jenny Lin

      This session deals with how to conscientiously approach causal inference in large, messy data sets common in tech, in the absence of an experiment (or when experimental setup was not ideal). In the real world, correlation is sometimes not enough basis for a million dollar business decision. That's where causal inference comes in. Causal inference establishes a causal link between effect X and outcome Y and is often necessary for making critical and expensive business choices. Many pitfalls exist that render "simple" causal analyses entirely misleading and potentially costly. Here, Jenny will discuss some of the approaches taken at Yelp in determining causality when faced with a common question across tech firms: how do we know that our implementation of feature X caused an effect on metric Y and what was the size of the effect? Factors to correct for when extrapolating causality include: selection bias into the comparison groups, time trends in the outcome feature, time period mismatches across observations, addressing multicollinearity, clustering standard errors, and more! Jenny will walk through a stylized example of a causal inference problem she ran into at Yelp and showcase how one can easily arrive at a very misleading conclusion when not correcting for the aforementioned issues.

    • Conda, Docker & Kubernetes: The Cloud-Native Future of Data Science

      Open Source Mathew Lodge

      The days of deploying Java code to Hadoop and Spark data lakes for data science and ML are numbered. Learn how you can deploy your Python and R data science apps on a Kubernetes-managed container cluster and just access the data lake over your modern network.

    2:00PM

    2:50PM

    • Using Machine Learning to Drive Sales: An Introduction to Anaconda Enterprise

      Anaconda Gus Cavanaugh

      It’s April 15, 1912. Reports are streaming into New York that the HMS Titanic just sank. You are in the mortuary business in New York City. You know the White Star Line is paying for all funeral services for the deceased, but you only have a partial list of passengers with identified outcomes (perished/survived). Your competitors all have access to the same initial reports. Can you predict which of the remaining passengers on the manifest are most likely to have perished? If so, you can then contact their families first (before the other funeral homes) and sell them your services. This presentation will cover how to use the passenger data with identified outcomes (perished/survived) to predict the outcome of the remaining passengers using Anaconda Enterprise. You’ll learn how to approach this problem with simple pivot tables and then build a predictive model using machine learning—all readily available and easy to use inside of Anaconda Enterprise.

    • Building a GPU-Focused CI Solution

      Real WorldMike Wendt

      As the number of GPU-accelerated applications has multiplied, the need for better development tools and services have increased as well. Chief among such services is continuous integration (CI), which dramatically improves and speeds up the development lifecycle through automated builds and integration testing. CI for GPU-accelerated applications comes with its own set of challenges, but the rewards can be enormous. Join NVIDIA ’s team as they walk through how they implemented CI by leaning on open source technologies such as Conda, Docker, and Jenkins, the lessons they learned in the process, and how other such systems should be built in the future.

    • Solving for Data Access in Data Science

      Open SourceJacques Nadeau

      In this talk, we'll provide an overview of Dremio and Arrow, and outline how projects like Pandas are utilizing Arrow to achieve high performance data processing and interoperability across systems. We'll then show how companies can utilize Arrow to enable users to access and analyze data across disparate data sources without having to physically consolidate it into a centralized data repository. Finally, we'll discuss the 12-month roadmap of Apache Arrow, including Arrow Kernels, an exciting area of development in the Arrow community.

    3:00PM

    3:50PM

    • Closing Keynote

      David Yeager, Expert in Grit and Growth Mindset

      After three days filled with data science and machine learning, we will conclude our program with human learning: Dr. David Yeager, Assistant Prof. of Developmental Psychology at University of Texas, Austin, is a renowned expert on grit and how growth mindsets can motivate people to achieve excellence. In this keynote, Dr. Yeager will outline key insights from the new science of motivation and learning.

  • View by Track

    Anaconda  |   Real World  |   Open Source

    Day Two

    Monday, April 9

    11:00AM

    11:50AM

    • Quick and Easy TensorFlow on AE5

      Anaconda Michael Grant

      In this talk, we will show you how to make the most of TensorFlow under Anaconda Enterprise 5. Creating AE5 projects with TensorFlow is a snap, thanks to the Anaconda packaging ecosystem. We'll show you how to take and a la carte approach to its capabilities—for instance, by feeding data directly from your Pandas-based pipelines into a TensorFlow model. And of course, we'll give you an overview of the learning methods available in the package, and demonstrate several of them in operation on AE5.

    1:00PM

    1:50PM

    • Deploying Python and R to Spark and Hadoop

      Anaconda Daniel Rodriguez

      Anaconda Enterprise provides an easy yet powerful architecture that allows people to connect from interactive sessions and deployments to running Spark clusters from Python and R. We will take a look at the Anaconda Enterprise 5 architecture for connecting to Hadoop/Spark clusters powered by Sparkmagic and Apache Livy (incubating), while taking a look at the benefits of this architecture and how it allows users to securely and easily connect to remote Hadoop/Spark clusters. We also will look at how Anaconda Enterprise enables users to do runtime distribution of custom Anaconda installers using Cloudera Parcels and Ambari Management Packs, allowing data scientists to ship Anaconda environments and leverage libraries from Anaconda.

    2:00PM

    2:50PM

    • Architecting AE5 Deployments

      Anaconda Kris Overholt

      We’ll discuss top-of-mind administrative considerations for implementing Anaconda Enterprise in your environments and how to move between multiple IT environments, such as: testing, QA, and production. Discover best practices for network, security, and performance that allows Anaconda Enterprise to connect to existing infrastructure, security, and compute/storage resources.

    3:10PM

    4:00PM

    • Enterprise Package Governance

      Anaconda Duane Lawrence

      While the benefits of open source are clear to data scientists, enterprise IT administrators have some concerns. Just who exactly authored these open source packages? How can we be sure that they are secure? When data science teams create their own packages, how can these internal tools be shared and governed securely? This talk will walk through best practices for enterprise package governance. IT admins will learn how to securely manage open source packages, strategies for whitelisting and blacklisting, and how to easily share internal packages securely.

    4:10PM

    5:00PM

    • Using Advanced Analytics to Develop a Risk-Based Approach to Asset Management

      AnacondaWill Collins

      In this talk, Will is going to illustrate how, by embracing data science and open source tools within the Anaconda platform, National Grid is building a risk-based framework to effectively manage its Electricity Transmission assets. Attendees will learn how National Grid's Analytics team has used Python to drive efficiencies within the business, developing tools that allow engineers to streamline their processes.

    Day Three

    Tuesday, April 10

    10:00AM

    10:50AM

    • conda: Tips & Tricks

      AnacondaKale Franz

      Anaconda solved one of the most headache-inducing problems in data science—overcoming the dependency nightmare—through the power of conda. Powering Anaconda Distribution at its core, our open source, cross-platform package and environment manager is beloved by data scientists because it enables them to easily install and manage their favorite packages from Python and other languages. In addition, data scientists can create virtual conda environments, in which they are free to install Python, pip, and any number of other packages. This means a data scientist can easily flip between projects using Python 2 and Python 3, for example. Kale will demonstrate these key features and many more.

    11:00AM

    11:50AM

    • Scalable Machine Learning with Dask

      AnacondaTom Augspurger

      Scikit-Learn, NumPy, and pandas form a great toolkit for single-machine, in- memory analytics. Scaling them to larger datasets can be difficult, as you have to adjust your workflow to use chunking or incremental learners. Dask provides NumPy- and pandas-like data containers for manipulating larger than memory datasets, and dask-ml provides estimators and utilities for modeling larger than memory datasets. These tools scale your usual workflow out to larger datasets. We’ll discuss some of the challenges data scientists run into when scaling out to larger datasets. We’ll then focus on demonstrations of how dask and dask-ml solve those challenges. We’ll see examples of how dask can expose a cluster of machines to scikit-learn’s built-in parallelization framework. We’ll see how dask-ml can train estimators on large datasets..

    1:00PM

    1:50PM

    • Access All of Your Data with Anaconda

      AnacondaAlbert DeFusco

      What do you do when you have data on Hadoop, SQL databases, CSV files, JSON files, SAS, and online? Learn how to utilize Anaconda to read data from multiple sources to build models and visualizations.

    2:00PM

    2:50PM

    • Model Management with Anaconda Enterprise

      AnacondaMichael Grant

      In this talk, we present an approach that leverages best practices of source code management combined with Anaconda's conda package management system to deliver a model management workflow sufficiently flexible to cover a wide range of applications. Our aim is to cover all but the most lightweight, loosely governed models. The features of our approach include: manageable separation of of training data, model infrastructure, and model design; leveraging source control for model versioning, approval processes, and experimental models; building conda packages for reliable, reproducible model deployment; and using Anaconda repositories to ensure historical model archiving.

    3:10PM

    4:00PM

    • PyViz: Dashboards for Visualizing 1 Billion Datapoints in 30 lines of Python in Your Browser on Your Laptop

      AnacondaJames Bednar

      James will present an overall workflow for building interactive dashboards visualizing billions of data points interactively in a Jupyter notebook, with graphical widgets allowing control over data selection, filtering, and display options, all using only a few dozen lines of code.

    4:10PM

    5:00PM

    • How to Optimize Physical Assets Through ML-Powered Predictive Maintenance

      AnacondaSourav Dey & Rajendra Koppula

      With decreasing sensor, communication, storage and compute costs, it is now possible to collect vast amounts of data from physical assets like mechanical equipment, heavy earth moving equipment, and factory assembly lines. With recent advances, companies can now apply Machine Learning to all that data to optimize physical assets with the goal of achieving zero downtime. Using a real world case study, Sourav will demonstrate: 1) how to sample a time series dataset to generate training and validation datasets; 2) how to deal with non-uniformly sampled time series and other data quality issues; 3) how to use Dask to parallelize compute intensive feature creation to handle the volume and velocity of sensor data; 4) how to apply classic ML techniques like Random Forest trees; and 5) results and outcomes.

    Day Four

    Wednesday, April 11

    10:00AM

    10:50AM

    • Anaconda Distribution Roadmap

      AnacondaCrystal Soja

      We will cover the six parts of the Anaconda Distribution and how they are connected. These include the Anaconda & Miniconda installers, repo.anaconda.com, anaconda.org, Anaconda Navigator, conda & conda-build. We will then delve into the upcoming plans and release cadences for each aspect. We will also cover additional themes like signed packages and an improved conda/pip/wheels user experience that will impact multiple parts of the Anaconda Distribution.

    11:00AM

    11:50AM

    • conda Deep Dive

      AnacondaKale Franz

      In this talk, we'll dive deep into the guts of what makes conda so special. We'll look at the full life cycle of a conda package, the details of how a package is constructed, and the details of how a package gets installed into an environment. We’ll walk through the three stages of core conda operations, including the conda solver, and also how packages are transactionally installe

    1:00PM

    1:50PM

    • If you’re not doing A/B Testing with advanced analytics, you’re doing it wrong

      AnacondaKrissy Tripp

      How often are we throwing away insights because checking a p-value is easy? Our A/B Tests offer a wealth of data that we’re not taking advantage of when all we do is evaluate a p-value. By leveraging applied statistical methods such as hazard models, ANOVA, clustering, and regression, we can maximize our insights and the business impacts of A/B Testing. This talk will show you how advanced statistical methods can evolve every step of the testing process. As customer demands evolve, so too should our methods of optimization. Innovative enterprises are beyond using A/B tests to evaluate website color schemes. Rather, they are using them to gain valuable consumer behavior knowledge in order to inform big bets and deliver personalized experiences. We’ll use real-world business problems, A/B test situations, and Anaconda to illustrate how A/B testing and data science teams can collaborate to deliver amazing customer experiences, uncover novel consumer insights, and scale the process to enterprise levels.

    2:00PM

    2:50PM

    • Fraud Prevention in Financial Services: Advanced Anaconda Enterprise

      AnacondaGus Cavanaugh

      Fraud is an expensive problem for financial services companies. In this talk, we’ll review credit card transactions and use machine learning to predict which transactions are fraudulent. Our sample data is highly skewed—less than 1% of the transactions in our training data are fraudulent. As a machine learning challenge, this talk will demonstrate how to approach classification problems with highly unbalanced data. We will explore the data using DataFrames via the pandas library, after which we’ll build a machine learning model using scikit-learn. We will analyze the results of our model by building an Area Under the Precision-Recall Curve (AUPRC). Through this data science project, attendees will learn how Anaconda Enterprise makes reproducibility, collaboration, and deployment simple for data science teams. This talk will discuss advanced features of Anaconda Enterprise for security and administration..

    Day Two

    Monday, April 9

    11:00AM

    11:50AM

    • Fast Feature Evaluation with MCPT and Numba

      Real World David Patschke

      One of the challenges when working with high-dimensional data is quickly being able to find the independent variables that most strongly influence the dependent variable. Unfortunately, the higher the dimensionality, the more likely that any influential independent variable selected may be influential by random luck. Enter Monte Carlo Permutation Testing (MCPT). By permuting the dependent variable many times, calculating information measures on these permutations, and comparing these measures to the actual information measure, a practitioner can be more confident that a selected variable truly will be informative. Until recently, the ability to execute something like this efficiently within Python was rather challenging. However, with the recently added ParallelAccelerator functionality within Numba, this can be executed in a single memory space at a blisteringly fast pace all in native Python. The goal of this talk is not only to introduce the concept of MCPT, but to inspire others to explore using the ParallelAccelerator functionality within Numba for significant speed advantages.

    1:00PM

    1:50PM

    • Achoo: Using Machine Learning to Fight My Son’s Asthma

      Real WorldTim Dobbins

      Achoo uses a Raspberry Pi to predict if Tim’s son will need his inhaler on any given day using weather, pollen, and air quality data. If the prediction for a given day is above a specified threshold, the Pi will email both Tim and the school nurse, notifying her that he may need preemptive treatment. The system is designed to be language-agnostic with regard to the predictive models used. The backend is built with Python/Flask.

    2:00PM

    2:50PM

    • Accelerating Scientific Workloads with Numba

      Real WorldSiu Kwan Lam

      This talk introduces Numba and demonstrates how its core features (e.g. JIT compilation, automatic parallelization, GPU programming) can speed up scientific workloads with minimal code changes. We also will discuss the upcoming features of Numba.

    3:10PM

    4:00PM

    • Deep Learning Using Python on Anaconda Distribution to Identify Distracted Driver Behavior

      Real WorldSripriya Sundararaman

      I will present the experiences our team has had using Deep Learning on the Anaconda framework. Our solution involves building CNN models on 2D and 3D images of drivers to identify and classify distracted driving behavior. The talk will discuss the challenges, the algorithms we used, and python solutions for collecting data from the 3D camera sensors and 2D camera, preprocessing data, creating our models, and predicting outcomes.

    4:10PM

    5:00PM

    • Data Engineering for Data Scientists

      Real WorldMax Humber

      When models and data applications are pushed to production, they become brittle black boxes that can and will break. In this talk you’ll learn how to one-up your data science workflow with a little engineering! Or more specifically, about how to to improve the reliability and quality of your data applications... all so that your models won’t break (or at least won’t break as often)!  Examples for this session will be in Python 3.6+ and will rely on: logging to allow us to debug and diagnose things while they’re running, Click to develop “beautiful” command line interfaces with minimal boiler-plating, and Pytest to write short, elegant, and maintainable tests.

    Day Three

    Tuesday, April 10

    10:00AM

    10:50AM

    • Building Better Badass Cars

      Real WorldPeter Buschbacher

      Cars are incredibly difficult to manufacture. The fracture between IT and Business has forced a lot of analytical development down the drain in the past. However, with current capabilities for data extraction, analysis, and computation, vehicle production is being continuously improved upon. This talk will focus on how analysis, questioning, and foundational data science have helped plant managers across the globe bring solutions to difficult problems in the manufacturing sphere.

    11:00AM

    11:50AM

    • Machine Learning Crash Course

      Real World Samuel Taylor

      Machine learning is surrounded by so much hype it can seem like magic. Learn the math behind the magic in this whirlwind tour of machine learning. After spending a few minutes learning about machine learning theory, we'll jump right into practice with three different use cases: teaching a computer sign language (supervised learning); predicting hourly energy load in the state of Texas (time series/forecasting); and using machine learning to find your next job (recommender systems—content-based filtering). With each use case, we'll discover new techniques applicable in real-world machine learning problems.

    1:00PM

    1:50PM

    • Learning From Our Learners

      Real WorldJonathan Cornelissen

      Description coming soon!

    2:00PM

    2:50PM

    • What You Gonna Do With All That Malware: Malware Analysis and Machine Learning
      When You Can’t Fit It All On One Server

      Real World Austin West & Drew Bonasera

      MultiScanner is an open source malware analysis framework that assists the user in evaluating a set of files by automatically running a suite of tools and aggregating the output. The true power of this system is that it stores all the outputs from all of an analyst’s malware analysis tools in one highly performant, searchable, and scalable data store. This talk will focus on one such analytic known as Exe-MANA. Exe-MANA is a deep neural network written entirely in Python for detecting if a Portable Executable file is malicious or benign using only static analysis. Exe-MANA is a great example of how easy it is to prototype data science techniques in Python with little to no experience in data science. Austin and Drew will go over the basic process for building Exe-MANA, and how they leverage MultiScanner to help speed up this process and continue the training process as they get new data.

    3:10PM

    4:00PM

    • How to Make Your Data Scientists Happy - a Use-case Backed Approach for Enabling Data Science in the Enterprise

      Real World Hussain Sultan & Tim Horan

      Enabling data scientists within an enterprise requires a well-thought out approach from an organization, technology, and business results perspective. In this talk, Tim and Hussain will share common pitfalls to data science enablement in the enterprise and provide their recommendations to avoid them. Taking an example, actionable use case from the financial services industry, they will focus on how Anaconda plays a pivotal role in setting up big data infrastructure, integrating data science experimentation and production environments, and deploying insights to production. Along the way, they will highlight opportunities for leveraging open source and unleashing data science teams while meeting regulatory and compliance challenges.

    4:10PM

    5:00PM

    • Apache Arrow: A Cross-Language Development Platform for In-Memory Data

      Real WorldWes McKinney

      In this talk, Wes will discuss the ongoing community efforts since 2016 building Apache Arrow, a new open source project for high-performance analytics and data interoperability. Wes will explain the rationale for the project and the development that has occurred over the last two-and-a-half years. He will cover some downstream use cases in Apache Spark and other open source data processing projects, and also will look at the future roadmap for Arrow as it relates to data science and machine learning applications.

    Day Four

    Wednesday, April 11

    10:00AM

    10:50AM

    • IoT Predictive Maintenance using Recurrent Neural Networks

      Real World Justin Brandenburg

      The idea behind predictive maintenance is that the failure patterns of various types of equipment are predictable. If an organization can accurately predict when a piece of hardware will fail, and replace that component before it fails, it can achieve much higher levels of operational efficiency. With many devices now including sensor data and other components that send diagnosis reports, predictive maintenance using big data is increasingly accurate and effective. In this case, how can we enhance our data monitoring to predict the next event? This talk will present an actual use case in the IoT industry 4.0 space. Justin will present an entire workflow of data ingestion, bulk ETL, data exploration, model training, testing, and deployment in a real time streaming architecture that can scale. He will demonstrate how he used Anaconda Python 3.5 and Pyspark 2.1.0 to wrangle data and train a recurrent neural network to predict whether the next event in a real time stream indicated that maintenance was required.

    11:00AM

    11:50AM

    • Setting Big Data on Fire: the FireCARES and NFORS projects

      Real World Craig Weinschenk

      Local government decision-makers often alter fire department resources faster than fire service leaders can evaluate the potential impact. These decisions can leave a community without sufficient resources to respond to emergency calls safely, efficiently, and effectively. The Fire Community Assessment/Response Evaluation System (FireCARES) provides fire departments the ability to add a technical basis to what historically has been an anecdotal discussion regarding community hazards and risks as well as the impact of changes to fire department resource levels. To accomplish this task, FireCARES provides three scores for each community based on the available data: the Community Risk Score, the Fire Department Performance Score, and the Safe Grade. These scores are generated from exploiting an expansive, multi-layered data set combining fire incidents, outcomes, and community risk characteristics. Fire incident data is not without flaws as it primarily relies on firefighters for data entry. Additionally, on the national level there is a two-year data lag. To overcome this obstacle, we have built the National Fire Operations Reporting System (NFORS), a real-time data analysis tool which leverages modern data practices while removing firefighters from data entry.

    1:00PM

    1:50PM

    • Causal Inference in Tech

      Real WorldJenny Lin

      This session deals with how to conscientiously approach causal inference in large, messy data sets common in tech, in the absence of an experiment (or when experimental setup was not ideal). In the real world, correlation is sometimes not enough basis for a million dollar business decision. That's where causal inference comes in. Causal inference establishes a causal link between effect X and outcome Y and is often necessary for making critical and expensive business choices. Many pitfalls exist that render "simple" causal analyses entirely misleading and potentially costly. Here, Jenny will discuss some of the approaches taken at Yelp in determining causality when faced with a common question across tech firms: how do we know that our implementation of feature X caused an effect on metric Y and what was the size of the effect? Factors to correct for when extrapolating causality include: selection bias into the comparison groups, time trends in the outcome feature, time period mismatches across observations, addressing multicollinearity, clustering standard errors, and more! Jenny will walk through a stylized example of a causal inference problem she ran into at Yelp and showcase how one can easily arrive at a very misleading conclusion when not correcting for the aforementioned issues.

    2:00PM

    2:50PM

    • Building a GPU-Focused CI Solution

      Real WorldMike Wendt

      As the number of GPU-accelerated applications has multiplied, the need for better development tools and services have increased as well. Chief among such services is continuous integration (CI), which dramatically improves and speeds up the development lifecycle through automated builds and integration testing. CI for GPU-accelerated applications comes with its own set of challenges, but the rewards can be enormous. Join NVIDIA ’s team as they walk through how they implemented CI by leaning on open source technologies such as Conda, Docker, and Jenkins, the lessons they learned in the process, and how other such systems should be built in the future.

    Day Two

    Monday, April 9

    11:00AM

    11:50AM

    • Deep Learning with Just a Little Bit of Data

      Open SourceMichael Bernico

      There’s no question that deep learning is changing the field of machine learning at an extremely rapid pace. Given enough data, deep learning can solve problems we couldn’t imagine just a few years ago. But what do we do when there isn’t enough data? Can we still apply deep learning when we only have hundreds, or thousands of data points? In this talk we will discuss doing deep learning with very little data. We will discuss the topic of transfer learning, which we find to be immensely useful for the business applications of deep learning. Finally, we will present some original research that shows just how far we can go with transfer learning on very small volumes of data.

    1:00PM

    1:50PM

    • DUKE: Dataset Understanding through Knowledge-base Embeddings

      Open Source David Sullivan

      "DUKE: Dataset Understanding through Knowledge-base Embeddings" produces abstractive descriptions of datasets based on word2vec model trained on wikipedia paired with a curated ontology. For those familiar with word2vec, you can think of DUKE as essentially "dataset2vec". This talk will discuss the technology behind DUKE, how DUKE can be used to improve the data science and data engineering process, and how the audience can download and use the software.

    2:00PM

    2:50PM

    • Getting Started with Anaconda Distribution

      Open Source Albert DeFusco

      What is it that has made the Anaconda Distribution so successful, and how can I make the best use of it? This talk will take you on a tour of the Anaconda Distribution and show you how to do powerful open source data science using the tools and libraries included in Anaconda.

    3:10PM

    4:00PM

    • Production-Grade Packaging with Anaconda

      Open Source Mahmoud Hashemi

      Anaconda always has been a powerful platform for data analysts and scientists across the Python world. The same reasons that make it work for those groups also apply to engineers building and shipping scalable services: easy access to prebuilt packages, including system packages not managed by pip, and other packages not conveniently provided by the operating system. This talk will cover using conda and conda envs in real-world industrial settings, what makes conda special for software engineers, and the challenges and goals of packaging. Mahmoud will provide real-world examples using Anaconda to build an OS package (RPM) and Docker images.

    4:10PM

    5:00PM

    • Convolutional Neural Networks (CNNs): A Game-Changer for Computer Vision

      Open Source Tassos Sarbanes

      Many call visual data the “dark matter” of the internet. There are many disciplines—such as biology, physics, psychology, mathematics, and computer science—surrounding computer vision. This talk will present a brief history of computer vision and processing’s evolution and revolution. We’ll review important studies including Larry Roberts’s Block World, the Summer Vision Project by MIT, Vision by David Marr, Explaining Visual Science by David Lowe, Normalized Cut by Shi & Malik, and Face Detection by Viola & Jones. The focus of the talk will be the introduction of convolutional neural networks (CNNs) and their huge impact on the computer vision space. Code examples based on Jupyter Notebooks will be presented by implementing the PASCAL Visual Object and ImageNet (WordNet) datasets. We’ll also cover general topics in machine learning and deep learning related to visual recognition, such as object detection, action classification, and image captioning.

    Day Three

    Tuesday, April 10

    10:00AM

    10:50AM

    • Building a Data Science Team using Open Source Data Science

      Open Source Katrina Riehl

      Open source data science technologies have changed the face of building and operating a data science organization. In this talk, Katrina will explore how and why open source technologies are necessary for the success of businesses hoping to use data science and machine learning to power innovation. She will discuss how HomeAway.com is using tools like Anaconda, conda, and other Python-powered open source libraries to change how they look at their market and stay competitive. She will also discuss her journey in making Python a first-class citizen in a traditionally Java-based organization while growing a data science team from the ground up.

    11:00AM

    11:50AM

    • GPU-Accelerating UDFs in PySpark with Numba and PyGDF

      Open Source Joshua Patterson & Keith Kraus

      With advances in computer hardware such as 10 gigabit network cards, infiniband, and solid state drives all becoming commodity offerings, the new bottleneck in big data technologies is very commonly the processing power of the CPU. In order to meet the computational demand desired by users, enterprises have had to resort to extreme scale out approaches just to get the processing power they need. One of the most well known technologies in this space, Apache Spark, has numerous enterprises publicly talking about the challenges in running multiple 1000+ node clusters to give their users the processing power they need. This talk is based on work completed by NVIDIA’s Applied Solutions Engineering team. Attendees will learn how they were able to GPU-accelerate UDFs in PySpark using open source technologies such as Numba and PyGDF, the lessons they learned in the process, and how they were able to accelerate workloads in a fraction of the hardware footprint.

    1:00PM

    1:50PM

    • Real-Time Processing with Dask

      Open Source Matt Rocklin

      Dask is a tool for parallel and distributed processing in Python often known for parallelizing subsets of libraries like NumPy, Pandas, and Scikit-Learn. However, Dask also includes a low-level real-time task scheduler capable of asynchronous computation. This talk describes how to leverage the internal engine of Dask to build responsive distributed systems that react to real-world events with computation in a resilient and scalable manner. We will start with a foundational futures API, and build on that to include async-await functionality and streaming dataframes.

    2:00PM

    2:50PM

    • DeepFashion: Building a REST API to Detect Clothing Styles

      Open SourcePaige Bailey

      In this hands-on tutorial, we’ll be using image recognition to take an existing deep learning model and adapt it to our own specialized domain (namely: guessing whether articles of clothing are preppy, sporty, punk, etc.). Instead of using a more data-intensive classifier, like a Residual Network, we’ll be using deep transfer learning to overcome our data scarcity problem and to build on top of an existing model. Once our transfer learning model has been trained, we’ll pack it up into a dockerized container (specifying inputs and outputs, as well as a score.py file), and then call it as a web service. We will also discuss a #DataOps process for refreshing the model as trends change over time. By the end of this talk, you (or at least your model!) will know how to select the perfect outfit for any occasion.

    3:10PM

    4:00PM

    • Jumpstart Writing Continuous Applications with Structured Streaming Python APIs in Apache Spark

      Open Source Jules Damji

      We are in the midst of a Big Data Zeitgeist in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that reacts and interacts with data in real-time. We call this a continuous application. In this talk we will explore the concepts and motivations behind continuous applications and how Structured Streaming Python APIs in Apache Spark 2.x enables writing them. We also will examine the programming model behind Structured Streaming and the APIs that support them. Through a short demo and code examples, Jules will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historic data to perform advanced analytics using Spark SQL, DataFrames, and Datasets APIs.

    4:10PM

    5:00PM

    • Building an Enterprise Analytics Competency Center (ACC)

      Open SourceAnn Manchella & Jim Ogle

      In this presentation, walk through PNC’s 2-year timeline as they journey from creating an Enterprise Data Management Group to a successful redevelopment of a Forecasting model in Python. Travel with PNC in their expedition from platform ownership nuances, empowering/training the LOB users to the Python, their Engagement Model with the lines of defense teams, the services offered by the Analytics team, the challenges with open source libraries and packages in a cyber-security-challenged banking industry, and much more.

    Day Four

    Wednesday, April 11

    10:00AM

    10:50AM

    • Accelerating Deep Learning with GPUs

      Open Source Stan Seibert

      GPU acceleration has become critical for deep learning as models have become more complex and datasets have grown in size. Although not initially designed for deep learning, GPUs were recognized early on as having an architecture well-adapted to speeding up the massively parallel array calculations at the heart of deep learning. Speedups of 10x or more during the training process are often seen using GPUs, and many models can be scaled up to use multiple GPUs. GPU manufacturers, like NVIDIA, are starting to release GPUs with deep learning-specific features to further speedup model training and improve the throughput of deployed models. Installing and deploying GPU accelerated code can be challenging, so Anaconda has curated the most popular deep learning frameworks and packed them with GPU acceleration in the Anaconda Distribution. There they can be combined with your favorite Python packages—including Pandas, Dask, and Jupyter—to power data science experiments and production deployments.

    11:00AM

    11:50AM

    • Beyond Neural Networks: What You're Missing Out On in TensorFlow

      Open Source Joseph Nelson

      Google's open source machine learning library, TensorFlow, is a well-known tool for building neural networks and is available via conda-forge. But ever since the 1.3 release (Fall 2017), Google has introduced many underutilized features that simplify the data science workflow. In this session, Joseph will introduce the high-level benefits of TensorFlow and walk through live examples exploring: (1) the Datasets API for seamlessly reading in datasets larger than memory; (2) TensorFlow Estimators, simple pre-packaged machine learning models comparable to those found in sklearn; and (3) TensorFlow Eager Execution Mode, which enables simpler debugging. Along the way, Joseph will solve real-world data science problems and highlight opportunities for open source contribution.

    1:00PM

    1:50PM

    • Conda, Docker & Kubernetes: The Cloud-Native Future of Data Science

      Open Source Mathew Lodge

      The days of deploying Java code to Hadoop and Spark data lakes for data science and ML are numbered. Learn how you can deploy your Python and R data science apps on a Kubernetes-managed container cluster and just access the data lake over your modern network.

    2:00PM

    2:50PM

    • Solving for Data Access in Data Science

      Open SourceJacques Nadeau

      Nobody likes getting data ready for their models, but it is an essential step at the beginning of most data science projects. Data access is a massive challenge because data lives in many different formats, in many different storage systems, and each analytical job has its own needs in terms of preparation and pre-processing. In this talk we’ll take a look at three essential open source projects that work together to simplify and accelerate access to data in any format at any scale.
      Dremio is a new open source project for self-service data that’s like Google Docs for datasets. Users can easily search, sample, reshape, and blend datasets through a browser, then access with their favorite tools, such as Python, R, and Tableau. Dremio builds on Apache Arrow, an open source, columnar, in-memory data representation that enables analytical systems and data sources to exchange and process data in real-time, simplifying and accelerating data access, without having to copy all data into one location.
      Arrow has emerged as a popular way way to handle in-memory data for analytical purposes. In the last year, Arrow has been embedded into a broad range of open source (and commercial) technologies, including GPU databases, machine learning libraries and tools, execution engines and visualization frameworks (e.g., Anaconda, Dremio, Graphistry, H2O, MapD, Pandas, R, Spark).
      In this talk, we provide an overview Dremio and Arrow, and outline how projects like Pandas are utilizing Arrow to achieve high performance data processing and interoperability across systems. For example, we demonstrate a 50x speedup in PySpark (Spark-Pandas interoperability). We then show how companies can utilize Arrow to enable users to access and analyze data across disparate data sources without having to physically consolidate it into a centralized data repository. For example, we demonstrate a join between Parquet files on S3, Oracle tables, and Elasticsearch indices. Finally, we discuss the 12-month roadmap of Apache Arrow, including Arrow Kernels, an exciting area of development in the Arrow community.