Spark – 第9页 – gitweixin

Spark 12月 14,2018

spark实例2:统计亚马逊联盟的导出的费用明细的csv

统计出自己关心的数据，并把部分关心的数据保存为csv。从这个实例可以学习到spark2.0+如何来读写csv文件，如何用spark sql来统计数据。

object AmazonFeeSQL {
  def main(arg: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("UmengPV").master("local[*]").getOrCreate(); //为读取的数据创建schema

    val taxiSchema = StructType(Array(
      StructField("Category", StringType, true),
      StructField("Name", StringType, true),
      StructField("ASIN", StringType, true),
      StructField("Seller", StringType, true),
      StructField("Tracking ID", StringType, true),
      StructField("Date Shipped", StringType, true),
      StructField("Price", StringType, true),
      StructField("Items Shipped", IntegerType, true),
      StructField("Returns", IntegerType, true),
      StructField("Revenue", DoubleType, true),
      StructField("Ad Fees", DoubleType, true),
      StructField("Device Type Group", StringType, true),
      StructField("Direct", StringType, true)
    ))
    val path = "E:\\newcode\\MyFirstProject\\data\\amazon\\fee"
    //跳过第一行的标题 .option("header","true")
    val data = spark.read.option("header","true").schema(taxiSchema).csv(path)
    //data.show()
    data.createTempView("amazon_fee")
    val df = data.toDF()

    //按受欢迎的分类倒序排列
    val resultRdd = df.sqlContext.sql("select Category, count(Category) as cateNum from amazon_fee GROUP BY Category order by cateNum DESC")
    resultRdd.show()

    //最受欢迎的商品排列
    val top1Rdd = df.sqlContext.sql("select * from amazon_fee WHERE Category = 'Home'")
    top1Rdd.show()

    //最受欢迎的商品排列
    val earnTopRdd = df.sqlContext.sql("SELECT * FROM amazon_fee WHERE ORDER BY Revenue DESC")
    earnTopRdd.show()

    //被退回次数最多的
    val returnTopRdd = df.sqlContext.sql("SELECT * FROM amazon_fee WHERE Returns > 0  ORDER BY Returns DESC")
    returnTopRdd.show()

    //统计价格区间内的商品数量
    val priceRangeRdd = df.sqlContext.sql("SELECT price_range, count(*) AS number FROM(select case when Price >= 0 and Price <= 4.99 then '0-5'  when Price >= 5 and Price <= 10 then '005-10'  when Price >= 10 and Price <= 14.99 then '010-15'   when Price >= 15 and Price <= 19.99 then '015-20'  when Price >= 20 and Price <= 24.99 then '020-25' when Price >= 25 and Price <= 49.99 then '025-50'   when Price >= 50 and Price <= 99.99 then '050-100'    else '100+' end as price_range FROM amazon_fee WHERE true) AS  price_summaries GROUP BY price_range ORDER BY price_range")
    priceRangeRdd.show()


    //购买前2名的类型的商品
    val top3Rdd = df.sqlContext.sql("SELECT * FROM amazon_fee WHERE Category = 'Home' OR Category = 'Toys & Games'")
    top3Rdd.show()
    top3Rdd.write.format("com.databricks.spark.csv").save("E:\\newcode\\MyFirstProject\\data\\Home_ToysGames_BestSeller")


  }
}

作者 east

Spark 12月 14,2018

spark实例1：统计一个 10 万人口的所有人的平均年龄

生成10万人口的生成文件代码：

object SampleDataFileGenerator {

  def main(args:Array[String]) {
    val writer = new FileWriter(new File("d:\\sample_age_data.txt"),false)
    val rand = new Random()
    for ( i <- 1 to 10000000) {
      writer.write( i + " " + rand.nextInt(100))
      writer.write(System.getProperty("line.separator"))
    }
    writer.flush()
    writer.close()
  }
}



使用RDD进行计算平均年龄：
 要计算平均年龄，那么首先需要对源文件对应的 RDD 进行处理，也就是将它转化成一个只包含年龄信息的 RDD，
 其次是计算元素个数即为总人数，然后是把所有年龄数加起来，最后平均年龄=总年龄/人数。  
 对于第一步我们需要使用 map 算子把源文件对应的 RDD 映射成一个新的只包含年龄数据的 RDD，
 很显然需要对在 map 算子的传入函数中使用 split 方法，得到数组后只取第二个元素即为年龄信息；
 第二步计算数据元素总数需要对于第一步映射的结果 RDD 使用 count 算子；
 第三步则是使用 reduce 算子对只包含年龄信息的 RDD 的所有元素用加法求和；最后使用除法计算平均年龄即可。

object AvgAgeCalculator {
  def main(args:Array[String]) {
   /* if (args.length < 1){
      println("Usage:AvgAgeCalculator datafile")
      System.exit(1)
    }*/
    val conf = new SparkConf().setAppName("Spark Exercise:Average Age Calculator")
    conf.setMaster("local")
    val sc = new SparkContext(conf)
    val dataFile = sc.textFile("d:\\sample_age_data.txt", 5);
    val count = dataFile.count()
    val ageData = dataFile.map(line => line.split(" ")(1))
    val totalAge = ageData.map(age => Integer.parseInt(
      String.valueOf(age))).collect().reduce((a,b) => a+b)
    println("Total Age:" + totalAge + ";Number of People:" + count )
    val avgAge : Double = totalAge.toDouble / count.toDouble
    println("Average Age is " + avgAge)
  }
}

作者 east

Spark 12月 14,2018

spark2.0+读写csv文件

spark2.0+不用集成第三方的库，可以很方便进行读写csv文件

import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}

object SparkReadFile {
def main(args: Array[String]): Unit = {
val localpath=”E:\\input\\test.csv”
val outpath=”E:\\output\\word”
val conf = new SparkConf()
conf.setAppName(“SparkReadFile”)
conf.setMaster(“local”)
val sparkContext = new SparkContext(conf)
val sqlContext = new SQLContext(sparkContext)
//读csv文件
val data: DataFrame = sqlContext.read.format(“com.databricks.spark.csv”)
.option(“header”, “false”) //在csv第一行有属性”true”，没有就是”false”
.option(“inferSchema”, true.toString) //这是自动推断属性列的数据类型
.load(localpath)
// data.show()
// 写csv文件
data.repartition(1).write.format(“com.databricks.spark.csv”)
.option(“header”, “false”)//在csv第一行有属性”true”，没有就是”false”
.option(“delimiter”,”,”)//默认以”,”分割
.save(outpath)
sparkContext.stop()
}
}

作者 east