留学生CS代写

提供各类型，各课程的高质量的Java代写, Python, C++/C, Algorithm代写, CPP代写, 算法和数据结构代写, Web网页代写,CS作业代写，在线考试实时辅导。专业博士、硕士学位的专家帮你完成各类CS作业代写，代码代做，保证100%通过，高成绩，准时性，honor code，无限写后支持，24小时在线客服，合理高性价比的CS代写价格！

APAN 5400 - Managing Data - Assignment 9 Apache Spark Distributed Application

Notes

发布日期: 2021-04-01

作业标题： APAN 5400 - Assignment 9 Apache Spark Distributed Application
课程名称：C-lumbia University APAN 5400 Managing Data
完成周期：2天

Develop an Apache Spark application per provided specifications and Crunchbase Open Data Map organizations dataset, using PySpark in Google Colab.

Details

Use the Week11_ClassExercise.ipynb (this file was sent to you in an announcement) as a reference:

Create a new notebook in Google Colab
Upload the crunchbase_odm_orgs.csv (this file was sent to you in an announcement) file and upload it to the “Files” section in your Colab notebook (may take a few minutes to upload)
Read the Crunchbase Orgs dataset into Spark DataFrame

Implement PySpark code using DataFrames, RDDs or Spark UDF functions:

Find all entities with the name that starts with a letter “F” (e.g. Facebook, etc.):
- print the count and show() the resulting Spark DataFrame
Find all entities located in New York City:
- print the count and show() the resulting Spark DataFrame
Add a “Blog” column to the DataFrame with the row entries set to 1 if the “domain” field contains “blogspot.com”, and 0 otherwise.
- show() only the records with the “Blog” field marked as 1
Find all entities with names that are palindromes (name reads the same way forward and reverse, e.g. madam):
- print the count and show() the resulting Spark DataFrame