datacamp

Introduction to Spark SQL with Python


  • flag Datacamp
  • student All Levels
  • database video
  • earth English
  • clock 4

About

Learn how to manipulate data and create machine learning feature sets in Spark using SQL in Python.

Description

You re familiar with SQL^ and have heard great things about Apache Spark. Then this course is for you! Apache Spark is a computing framework for processing big data. Spark SQL is a component of Apache Spark that works with tabular data. Window functions are an advanced feature of SQL that take Spark to a new level of usefulness. You will use Spark SQL to analyze time series. You will extract the most common sequences of words from a text document. You will create feature sets from natural language text and use them to predict the last word in a sentence using logistic regression. Spark combines the power of distributed computing with the ease of use of Python and SQL. The course uses a natural language text dataset that is easy to understand. Sentences are sequences of words. Window functions are very suitable for manipulating sequence data. The same techniques taught here can be applied to sequences of song identifiers^ video ids^ or podcast ids. Exercises include discovering frequent word sequences^ and converting word sequences into machine learning feature set data for training a text classifier.