大数据概述

什么是大数据

电商推荐系统

大量的订单如何存储
大量的订单如何计算

天气预报

大量的天气数据如何存储
大量的天气数据如何计算

核心问题

数据的存储
- 分布式文件系统 HDFS(Hadoop Distributed File System)
数据的计算
- 分布式的计算 MapReduce Spark(RDD 弹性分布式数据集)

数据仓库和大数据

数据仓库（Data Warehouse）

数据仓库就是一个数据库（Oracle, MySQL, MS），一般只做查询。

搭建数据仓库的过程

数据源
- 结构化数据：RDBMS
- 非结构化数据：文本、日志
TEL
- 抽取 > 转换 > 加载
数据仓库
- 原始数据
数据分析
- SQL
- PL/SQL
- JDBC程序
数据集市
- HR数据集市
- 销售数据集市
最终用户
- HR系统
- 销售系统

OLTP和OLAP

OLTP: Online Transaction Processing 联机事务处理（update, insert, delete）传统关系型数据库需要解决的问题
OLAP: Online Analytic Processing 联机分析处理（select）。数据仓库就是一种OLAP的应用系统。Hadoop、Spark看成是一种数据仓库的解决方案。

Google的基本思想，三篇论文

一、GFS

为何有了Oracle，还需要HDFS、GFS？

瓶颈和成本太高
Google的低成本之道

Google File System --> HDFS

分布式文件系统
大数据的存储问题
HDFS中，记录数据保存的位置信息（元信息），采用倒排索引（Reverted Index）
1. 什么是索引
  1. 其实就是一个目录
  2. 通过索引，快速找到对应的数据
2. 什么是倒排索引
  1. 分词 > 去重 > 排序

HDFS的体系架构

Name Node: 名称节点
Secondary Name Node: 第二名称节点
Data Node: 数据节点

Hadoop安装部署模式

本地模式
伪分布模式
全分布模式

HDSF = NameNode + SecondaryNameNode + DataNode

二、MapReduce

PageRank网页排名 --> MapReduce

Yarn = ResourceManager + NodeManager

启动节点

运行单词统计

bash

# 单词统计
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /input/data.txt /output/20191011/wc

# 部分日志
19/10/11 08:48:52 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/10/11 08:48:53 INFO input.FileInputFormat: Total input paths to process : 1
19/10/11 08:48:53 INFO mapreduce.JobSubmitter: number of splits:1
19/10/11 08:48:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1570754699917_0001
19/10/11 08:48:53 INFO impl.YarnClientImpl: Submitted application application_1570754699917_0001
19/10/11 08:48:53 INFO mapreduce.Job: The url to track the job: http://127.0.0.1:8088/proxy/application_1570754699917_0001/
19/10/11 08:48:53 INFO mapreduce.Job: Running job: job_1570754699917_0001
19/10/11 08:48:58 INFO mapreduce.Job: Job job_1570754699917_0001 running in uber mode : false
19/10/11 08:48:58 INFO mapreduce.Job:  map 0% reduce 0%
19/10/11 08:49:01 INFO mapreduce.Job:  map 100% reduce 0%
19/10/11 08:49:05 INFO mapreduce.Job:  map 100% reduce 100%
19/10/11 08:49:05 INFO mapreduce.Job: Job job_1570754699917_0001 completed successfully
19/10/11 08:49:05 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=93
		FILE: Number of bytes written=211351
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=160
		HDFS: Number of bytes written=55
		HDFS: Number of read operations=6
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=1349
		Total time spent by all reduces in occupied slots (ms)=1492
		Total time spent by all map tasks (ms)=1349
		Total time spent by all reduce tasks (ms)=1492
		Total vcore-seconds taken by all map tasks=1349
		Total vcore-seconds taken by all reduce tasks=1492
		Total megabyte-seconds taken by all map tasks=1381376
		Total megabyte-seconds taken by all reduce tasks=1527808
	Map-Reduce Framework
		Map input records=3
		Map output records=12
		Map output bytes=108
		Map output materialized bytes=93
		Input split bytes=101
		Combine input records=12
		Combine output records=8
		Reduce input groups=8
		Reduce shuffle bytes=93
		Reduce input records=8
		Reduce output records=8
		Spilled Records=16
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=45
		CPU time spent (ms)=0
		Physical memory (bytes) snapshot=0
		Virtual memory (bytes) snapshot=0
		Total committed heap usage (bytes)=347602944
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=59
	File Output Format Counters 
		Bytes Written=55




# 查看运行结果
hdfs dfs -cat /output/20191011/wc/part-r-00000
Beijing	2
China	2
I	2
capital	1
is	1
love	2
of	1
the	1

三、BigTable

大表 --> HBase

HBase = Zookeeper + HMaster(主节点) + RegionServer(从节点)

大数据概述 ​

什么是大数据 ​

电商推荐系统 ​

天气预报 ​

核心问题 ​

数据仓库和大数据 ​

数据仓库（Data Warehouse） ​

搭建数据仓库的过程 ​

OLTP和OLAP ​

Google的基本思想，三篇论文 ​

一、GFS ​

HDFS的体系架构 ​

Hadoop安装部署模式 ​

二、MapReduce ​

三、BigTable ​