我们将在 Athena 基础实验室中使用以下数据集。
我们先创建以下四个表:
Saved queries
中已保存了相关的四个SQL语句,打开分别运行它们:
创建csv文件表的SQL如下:
CREATE EXTERNAL TABLE customers_csv (
card_id bigint,
customer_id bigint,
lastname string,
firstname string,
email string,
address string,
birthday string,
country string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://athena-workshop-145197526627/basics/csv/customers/'
TBLPROPERTIES (
'areColumnsQuoted'='false',
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'=',',
'skip.header.line.count'='1',
'typeOfData'='file');
CREATE EXTERNAL TABLE sales_csv(
card_id bigint,
customer_id bigint,
price string,
product_id bigint,
timestamp string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://athena-workshop-145197526627/basics/csv/sales/'
TBLPROPERTIES (
'areColumnsQuoted'='false',
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'=',',
'skip.header.line.count'='1',
'typeOfData'='file');
其中创建parquet文件的表的语句里面有三个SQL,要先选中Create external table
语句才能运行:
CREATE EXTERNAL TABLE `customers_parquet`(
card_id bigint,
customer_id bigint,
lastname string,
firstname string,
email string,
address string,
birthday string)
PARTITIONED BY (
`country` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://athena-workshop-145197526627/basics/parquet/customers/'
TBLPROPERTIES (
'classification'='parquet',
'compressionType'='none',
'partition_filtering.enabled'='true',
'typeOfData'='file');
MSCK REPAIR TABLE customers_parquet;
SHOW PARTITIONS customers_parquet;
然后选择文本MSCK REPAIR TABLE customers_parquet并单击 RUN
选择文本 SHOW PARTITIONS customers_parquet并单击 RUN :
sales_parquet
表同样如此操作。
通过对数据进行分区,可以限制每个查询扫描的数据量,从而提高性能并降低成本。
可以按任何键对数据进行分区,常见的做法是根据日期或时间对数据进行分区,这通常会导致多级分区方案。
Athena 可以使用 Apache Hive 样式分区,其数据路径包含通过等号连接的键值对(例如,country=us/...
或year=2021/month=01/day=26/...
)。因此,路径包括分区键的名称和每个路径代表的值。要将新的 Hive 分区加载到分区表中,可以使用 MSCK REPAIR TABLE
命令,该命令仅适用于 Hive 样式的分区。
Athena 还可以使用非 Hive 风格的分区方案。例如,CloudTrail 日志和 Kinesis Data Firehose对日期部分使用单独的路径,例如 data/2021/01/26/us/6fc7845e.json
。对于非 Hive 兼容的数据,可以使用 ALTER TABLE ADD PARTITION
手动添加分区。
现在我们已经创建了表,我们将运行一些查询来测试表之间的性能:
SQL内容如下:
/* Lets find the top 10 products for Australia from the CSV table*/
select c.country, s.product_id, count(timestamp) as total_transactions
from customers_csv c join sales_csv s on c.customer_id = s.customer_id
where country ='Australia'
group by c.country,s.product_id
order by 1,3 desc
limit 10;
/* Lets find the top 10 products for Australia from the parquet tables*/
select c.country, s.product_id, count(timestamp) as total_transactions
from customers_parquet c join sales_parquet s on c.customer_id = s.customer_id
where country ='Australia'
group by c.country,s.product_id
order by 1,3 desc
limit 10;
/* Lets find the top 10 biggest spending customers from the sales CSV table */
select customer_id, sum(cast(price as decimal(6,2))) as total_sales
from sales_csv s
group by customer_id
order by 2 desc
limit 10;
/* Lets find the top 10 biggest spending customers from the sales parquet table */
select customer_id, sum(cast(price as decimal(6,2))) as total_sales
from sales_parquet s
group by customer_id
order by 2 desc
limit 10;
通过一次选择一个 SQL 语句块并单击“运行”按钮来按顺序运行查询。在“Query results”部分中,记下“运行时间”和“扫描数据”值:
我们看到,查询 parquet 文件时扫描的数据量少于查询 CSV 文件时的数据量。查询时间parquet更长,是因为数据量太小,当处理具有更多列和行的文件时,parquet通常会更快:
Query | Table | Time Taken | Data Scanned |
---|---|---|---|
Top Ten Products by transaction count in Australia | customers_csv,sales_csv | 1.74 sec | 22.97 MB |
Top Ten Products by transaction count in Australia | customers_parquet,sales_parquet | 3.07 sec | 5.35 MB |
Top Ten Customers By Total Spend | sales_csv | 0.83 sec | 12.46 MB |
Top Ten Customers By Total Spend | sales_parquet | 2.04 sec | 2.14 MB |