Twitter Data
Token type Month hashtag 200910 hashtag 200911 hashtag 200912 hashtag 200812 hashtag 200901 hashtag 200910 hashtag 200912 hashtag 200905 hashtag 200907 count 2 2 90 100 201 1 500 23 1000 Hash Tag Name babylove babylove babylove mycoolwife mycoolwife mycoolwife mycoolwife abc abc
[Spark RDD] Given two months x and y, where y> x, find the hashtag name that has y increased the number of tweets the most from month x to month y. Ignore the tweets in the months between x and y, so just compare the number of tweets at month x and at month y. Report the hashtag name, the number of tweets in months x and y. Ignore any hashtag names that had no tweets in either month x or y. You can assume that the combination of hashtag and month is unique. Therefore, the same hashtag and month combination cannot occur more than once. Print the result to the terminal output using println. For the above small example data set: Input x = 200910, y = 200912 Output hashtagName: mycoolwife, countX: 1, county: 500 For this subtask you can specify the months x and y as arguments to the script. This is required to test on the full-sized data. For example: $ bash build_and_run.sh 200901 200902
import org.apache.spark.sql.. import org.apache.spark.sql.types. - import org.apache.spark.SparkContext object Main { def solution(sc: SparkContext, x: String, y: String) { // Load each line of the input data val twitterLines = sc.textFile("Assignment_Data/twitter-small.tsv") // Split each line of the input data into an array of strings val twitterdata = twitterLines.map(_.split("\t")) println("Months: x = " + x + ", y = " + y) = // TODO: *** Put your solution here *** } // Do not edit the main function def main(args: Array[String]) { // Set log level import org.apache.log4j. {Logger, Level) Logger.getLogger("org").setLevel(Level. WARN) Logger.getLogger("akka").setLevel(Level.WARN) // Check command line arguments if(args.length != 2) { println("Expected two command line arguments: <month x> and <month y>") } // Initialise Spark val spark = SparkSession.builder .appName("Task2c") .master("local[4]") .config("spark.hadoop.validateOutputSpecs", "false") .config("spark.default.parallelism", 1) .getOrCreate() // Run solution code solution (spark.sparkContext, args(6), args (1)) // Stop Spark spark.stop() } }
Twitter Data
-
answerhappygod
- Site Admin
- Posts: 899604
- Joined: Mon Aug 02, 2021 8:13 am
Twitter Data
Join a community of subject matter experts. Register for FREE to view solutions, replies, and use search function. Request answer by replying!