記事

Teradata Does Open Source! Introduction to the R and Python Packages for Vantage

In part two of this three-part series, you’ll learn how to use Teradata's R and Python packages, tdplyr and teradataml, to run machine learning and predictive analytics in Vantage at scale.

Tim Miller

2020年2月9日 4 分で読める

In part 1 of this blog series, we introduced the two approaches that R and Python programmers can use to leverage the Teradata Vantage platform. We also touched on why this is important and how both techniques work at a high level. In this blog we focus in on the client-side languages and packages for R and Python – tdplyr and teradataml.

When R and Python are running external to Vantage, Teradata provides open source packages and native language SQL drivers that provide interfaces to the functionality available in Vantage. In addition, the packages are built upon well-known existing R and Python packages – chances are you are using these packages today, making your transition to Vantage as seamless as possible.

With this approach, users add the native language SQL driver along with one of two open source libraries from Teradata – tdplyr for R and teradataml for Python – to their R or Python installations. Both packages provide common functionality in the area of database context, connection and management. Most importantly, these packages make R Data Frames or pandas DataFrames appear locally to the programmer but are virtually pointing to tables or views in Vantage. What this means is that data is not moved back and forth between the client and Vantage, only metadata or small data samples.

In addition, both packages provide R or Python interfaces for Vantage Machine Learning, Graph and Advanced SQL Engine functions. This provides the R or Python programmer with 100’s of algorithms that run directly on the Vantage platform with the performance and scalability required to solve use cases process with exponentially growing data volumes. Let’s take a look at these open source packages.

Teradata SQL Drivers

The Teradata SQL Driver for R/Python enable applications written in R/Python to connect to Vantage. Based upon a common goSQL Driver, the Teradata SQL Drivers for R/Python are lighter weight than their ODBC or JDBC counterparts and make executing SQL from R/Python easier than ever before. Additionally, these drivers support Teradata’s FastLoad protocol for moving bulk data locally to the Vantage platform in parallel.

Teradata Package for R - tdplyr

As the name indicates, tdplyr is based upon the well-known R package dplyr, arguably the most widely used R package for data manipulation and preparation. R users can select variables based on their names, pick rows based on variable values, add new variables that are functions of existing variables, reduce multiple values down to a single summary or change the ordering of the rows using dplyr’s verbs. These functions all combine naturally with grouping, lending itself well to a SQL implementation; thus, dplyr’s companion package dbplyr, which abstracts these verbs to work against database tables, using the exact same R code that manipulates local Data Frames, is also used by tdplyr.

The look and feel are very similar to that of interacting with a regular Data Frame in R. Instead of a Data Frame, a tibble is used to represent data in the form of a table, view, or query in Vantage. The tdplyr and dplyr functions that access or manipulate a remote tibble are translated to equivalent SQL to be executed in Vantage through the Teradata SQL Driver for R connection. Only a subset of data and/or metadata is ever retrieved from Vantage unless explicitly requested – for example, copying a remote tibble to a local R Data Frame will result in bulk data movement.

In addition, tdplyr provides R interfaces to Vantage’s advanced analytic functions that are almost identical to their R counterparts – although the function names might differ, the arguments are as close to the R function’s signature as possible.

This short video, Using R and Python with Teradata Vantage: Part 2 tdplyr demo, gives you a step-by-step demonstration on how to use tdplyr.

Teradata Package for Python - teradataml

For Python users familiar with the Pandas Python package, the teradataml package builds on the concept and syntax of the pandas DataFrame object by creating the teradataml DataFrame object. A teradataml DataFrame is a reference to a database object on the Python client, representing a table, view, or query in the Vantage Advance SQL Engine.

The look and feel of a teradataml DataFrame is like a pandas DataFrame in Python, and the teradataml library provides an API to access and manipulate a teradataml DataFrame. These functions generate SQL requests that are executed in Vantage through the Teradata SQL Driver for Python connection. The teradataml package uses teradatasqlalchemy, an implementation of SQLAlchemy’ s Dialect interface, to provide enhanced support for data exploration and preparation.

SQLAlchemy considers the database to be a relational algebra engine, not just a collection of tables. Its object-relation mapper (ORM) maps Python classes and specific pandas DataFrame constructs to the database in such a way that rows can be selected from not only tables but also other select statements. Any of these objects can be composed into a larger structure using its so-called complimentarily oriented approach; “instead of hiding away SQL and object relational details behind a wall of automation, all processes are fully exposed within a series of composable, transparent tools,”* giving full control to the Python developer.

As with tdplyr, only a subset of data or metadata is ever retrieved from Vantage, unless the user explicitly requests data to be transferred to the client. For example, copying a teradataml DataFrame to a local pandas Data Frame will result in bulk data movement. Also, teradataml provides Python interfaces to Vantage’s advanced analytic functions, with signatures as close to their Python counterparts as possible.

This short video, Using R and Python with Teradata Vantage: Part 3 teradataml demo, gives you a step-by-step demonstration on how to use teradataml.

This is part two of a three-part blog series. Look for the third blog where we take a deeper dive into the second approach for utilizing R and Python in Vantage – the server-side languages and packages.

Start Optimizing Your Data Science Process

Take advantage of Vantage’s massively parallel platform (MPP) for performance and scalability while using R and Python. If you’re on a previous version of Teradata, and curious about upgrading to Vantage, contact us today.

*https://www.sqlalchemy.org – SQLAlchemy’s Philosophy