Course review: Getting Started with Data Analytics on AWS
A Coursera course by Amazon Web Services
As a data scientist I want to learn cloud. Getting Started with Data Analytics on AWS is a very short 5 hour course on Amazon Web Services (AWS) with a focus on Data Analytics so I gave it a shot and this is what I learned.
Step 1 - Setting up S3 object storage service
This course gave a brief overview of AWS from a data analytics perspective. It started with setting up an S3 environment, which is a cloud storage service where you can create different buckets for different things. This is how Amazon explains S31:
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. Customers of all sizes and industries can use Amazon S3 to store and protect any amount of data for a range of use cases, such as data lakes, websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics. Amazon S3 provides management features so that you can optimize, organize, and configure access to your data to meet your specific business, organizational, and compliance requirements.
Step 2 - Enable CloudTrail for logging
Next thing was to set up some logging of your company’s AWS account by starting a CloudTrail2 service in your S3 “server”.
AWS CloudTrail is an AWS service that helps you enable governance, compliance, and operational and risk auditing of your AWS account. Actions taken by a user, role, or an AWS service are recorded as events in CloudTrail.
Step 3 - Setting up Athena to parse all log files
CloudTrail generates lots of compressed (gz-format) JSON-files with log data. This is then used for analytics purpose. In order to easily parse these files, the instructor shows how to setup Amazon Athena3.
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
My take away is that Athena creates a “Data Warehouse-like environment” where we don’t need to collect, unzip and parse all these JSON-files before we can do SQL queries on their content. Quite nice! The data is only stored in the files generated by CloudTrail, so no duplicate data in the S3-server.
The output from our queries can be set up to be saved into a new bucket in our S3 environment as the instructor shows.
Step 4 - Visualize data in QuickSight
In order to visualize data like you can do in PowerBI, Tableau or QlikView, you can open up Amazon QuickSight and connect it directly to Amazon Athena. According to Amazon, QuickSight is...4
… a fast, cloud-powered business intelligence service that delivers insights to everyone in your organization. As a fully managed service, Amazon QuickSight lets you easily create and publish interactive dashboards that include machine learning (ML) insights.
If you think this was interesting you will find that there is a four week follow-up course on Designing Data Lakes in AWS on Coursera. There you will learn how to ingest data to the AWS Cloud from your systems.
Facts about Getting Started with Data Analytics on AWS
It’s free besides the final quiz and certificate. If you want to complete this you need to pay $39. I didn’t.
No installation needed, everything is done in your browser. You should set up an AWS account during the course in order to be able to follow the step by step instructions.
Time required to complete the course is about 5 hours.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html
https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html
https://docs.aws.amazon.com/managedservices/latest/userguide/athena.html
https://docs.aws.amazon.com/managedservices/latest/userguide/quicksight.html