大数据概述
什么是大数据
电商推荐系统
- 大量的订单如何存储
- 大量的订单如何计算
天气预报
- 大量的天气数据如何存储
- 大量的天气数据如何计算
核心问题
- 数据的存储
- 分布式文件系统
HDFS
(Hadoop Distributed File System)
- 分布式文件系统
- 数据的计算
- 分布式的计算
MapReduce
Spark
(RDD 弹性分布式数据集)
- 分布式的计算
数据仓库和大数据
数据仓库(Data Warehouse)
数据仓库就是一个数据库(Oracle, MySQL, MS),一般只做查询。
搭建数据仓库的过程
- 数据源
- 结构化数据:RDBMS
- 非结构化数据:文本、日志
- TEL
- 抽取 > 转换 > 加载
- 数据仓库
- 原始数据
- 数据分析
- SQL
- PL/SQL
- JDBC程序
- 数据集市
- HR数据集市
- 销售数据集市
- 最终用户
- HR系统
- 销售系统
OLTP和OLAP
- OLTP: Online Transaction Processing 联机事务处理(update, insert, delete)传统关系型数据库需要解决的问题
- OLAP: Online Analytic Processing 联机分析处理(select)。数据仓库就是一种OLAP的应用系统。Hadoop、Spark看成是一种数据仓库的解决方案。
Google的基本思想,三篇论文
一、GFS
为何有了Oracle,还需要HDFS、GFS?
- 瓶颈和成本太高
- Google的低成本之道
Google File System --> HDFS
- 分布式文件系统
- 大数据的存储问题
- HDFS中,记录数据保存的位置信息(元信息),采用倒排索引(Reverted Index)
- 什么是索引
- 其实就是一个目录
- 通过索引,快速找到对应的数据
- 什么是倒排索引
- 分词 > 去重 > 排序
- 什么是索引
HDFS的体系架构
- Name Node: 名称节点
- Secondary Name Node: 第二名称节点
- Data Node: 数据节点
Hadoop安装部署模式
- 本地模式
- 伪分布模式
- 全分布模式
HDSF = NameNode + SecondaryNameNode + DataNode
二、MapReduce
PageRank网页排名 --> MapReduce
Yarn = ResourceManager + NodeManager
启动节点
运行单词统计
bash
# 单词统计
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /input/data.txt /output/20191011/wc
# 部分日志
19/10/11 08:48:52 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/10/11 08:48:53 INFO input.FileInputFormat: Total input paths to process : 1
19/10/11 08:48:53 INFO mapreduce.JobSubmitter: number of splits:1
19/10/11 08:48:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1570754699917_0001
19/10/11 08:48:53 INFO impl.YarnClientImpl: Submitted application application_1570754699917_0001
19/10/11 08:48:53 INFO mapreduce.Job: The url to track the job: http://127.0.0.1:8088/proxy/application_1570754699917_0001/
19/10/11 08:48:53 INFO mapreduce.Job: Running job: job_1570754699917_0001
19/10/11 08:48:58 INFO mapreduce.Job: Job job_1570754699917_0001 running in uber mode : false
19/10/11 08:48:58 INFO mapreduce.Job: map 0% reduce 0%
19/10/11 08:49:01 INFO mapreduce.Job: map 100% reduce 0%
19/10/11 08:49:05 INFO mapreduce.Job: map 100% reduce 100%
19/10/11 08:49:05 INFO mapreduce.Job: Job job_1570754699917_0001 completed successfully
19/10/11 08:49:05 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=93
FILE: Number of bytes written=211351
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=160
HDFS: Number of bytes written=55
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=1349
Total time spent by all reduces in occupied slots (ms)=1492
Total time spent by all map tasks (ms)=1349
Total time spent by all reduce tasks (ms)=1492
Total vcore-seconds taken by all map tasks=1349
Total vcore-seconds taken by all reduce tasks=1492
Total megabyte-seconds taken by all map tasks=1381376
Total megabyte-seconds taken by all reduce tasks=1527808
Map-Reduce Framework
Map input records=3
Map output records=12
Map output bytes=108
Map output materialized bytes=93
Input split bytes=101
Combine input records=12
Combine output records=8
Reduce input groups=8
Reduce shuffle bytes=93
Reduce input records=8
Reduce output records=8
Spilled Records=16
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=45
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=347602944
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=59
File Output Format Counters
Bytes Written=55
# 查看运行结果
hdfs dfs -cat /output/20191011/wc/part-r-00000
Beijing 2
China 2
I 2
capital 1
is 1
love 2
of 1
the 1
三、BigTable
大表 --> HBase
HBase = Zookeeper + HMaster(主节点) + RegionServer(从节点)