Twitter Data

Post by **answerhappygod** » Fri May 20, 2022 5:27 pm

Twitter Data

: Twitter Data 1 (78.17 KiB) Viewed 34 times

: Twitter Data 2 (157.41 KiB) Viewed 34 times

: Twitter Data 3 (35.51 KiB) Viewed 34 times

Token type Month hashtag 200910 hashtag 200911 hashtag 200912 hashtag 200812 hashtag 200901 hashtag 200910 hashtag 200912 hashtag 200905 hashtag 200907 count 2 2 90 100 201 1 500 23 1000 Hash Tag Name babylove babylove babylove mycoolwife mycoolwife mycoolwife mycoolwife abc abc

[Spark RDD] Given two months x and y, where y> x, find the hashtag name that has y increased the number of tweets the most from month x to month y. Ignore the tweets in the months between x and y, so just compare the number of tweets at month x and at month y. Report the hashtag name, the number of tweets in months x and y. Ignore any hashtag names that had no tweets in either month x or y. You can assume that the combination of hashtag and month is unique. Therefore, the same hashtag and month combination cannot occur more than once. Print the result to the terminal output using println. For the above small example data set: Input x = 200910, y = 200912 Output hashtagName: mycoolwife, countX: 1, county: 500 For this subtask you can specify the months x and y as arguments to the script. This is required to test on the full-sized data. For example: $ bash build_and_run.sh 200901 200902

import org.apache.spark.sql.. import org.apache.spark.sql.types. - import org.apache.spark.SparkContext object Main { def solution(sc: SparkContext, x: String, y: String) { // Load each line of the input data val twitterLines = sc.textFile("Assignment_Data/twitter-small.tsv") // Split each line of the input data into an array of strings val twitterdata = twitterLines.map(_.split("\t")) println("Months: x = " + x + ", y = " + y) = // TODO: *** Put your solution here *** } // Do not edit the main function def main(args: Array[String]) { // Set log level import org.apache.log4j. {Logger, Level) Logger.getLogger("org").setLevel(Level. WARN) Logger.getLogger("akka").setLevel(Level.WARN) // Check command line arguments if(args.length != 2) { println("Expected two command line arguments: <month x> and <month y>") } // Initialise Spark val spark = SparkSession.builder .appName("Task2c") .master("local[4]") .config("spark.hadoop.validateOutputSpecs", "false") .config("spark.default.parallelism", 1) .getOrCreate() // Run solution code solution (spark.sparkContext, args(6), args (1)) // Stop Spark spark.stop() } }