APAN 5400 - Managing Data - Assignment 9 Apache Spark Distributed Application


  • 作业标题: APAN 5400 - Assignment 9 Apache Spark Distributed Application
  • 课程名称:C-lumbia University APAN 5400 Managing Data
  • 完成周期:2天

Develop an Apache Spark application per provided specifications and Crunchbase Open Data Map organizations dataset, using PySpark in Google Colab.

Details

Use the Week11_ClassExercise.ipynb (this file was sent to you in an announcement) as a reference:

  • Create a new notebook in Google Colab
  • Upload the crunchbase_odm_orgs.csv (this file was sent to you in an announcement) file and upload it to the “Files” section in your Colab notebook (may take a few minutes to upload)
  • Read the Crunchbase Orgs dataset into Spark DataFrame

Implement PySpark code using DataFrames, RDDs or Spark UDF functions:

  1. Find all entities with the name that starts with a letter “F” (e.g. Facebook, etc.):
    • print the count and show() the resulting Spark DataFrame
  2. Find all entities located in New York City:
    • print the count and show() the resulting Spark DataFrame
  3. Add a “Blog” column to the DataFrame with the row entries set to 1 if the “domain” field contains “blogspot.com”, and 0 otherwise.
    • show() only the records with the “Blog” field marked as 1
  4. Find all entities with names that are palindromes (name reads the same way forward and reverse, e.g. madam):
    • print the count and show() the resulting Spark DataFrame

 

Assessment

Please see the attached rubric for detailed assessment criteria.

Submission

To complete your submission,

  1. Please submit a PDF file or Word Document.
  2. Click the blue Submit Assignment button at the top of this page.
  3. Click the Choose File button, and locate your submission.
  4. Feel free to include a comment with your submission.
  5. Finally, click the blue Submit Assignment button.

。。。


文章作者: IT神助攻
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 IT神助攻 !
  目录