创建并查询表

我们将在 Athena 基础实验室中使用以下数据集。

  1. Customers- 一组综合客户记录(~ 96,000)
  2. Sales- 一组客户的综合销售记录。(~287,000)

img

创建表并运行查询

我们先创建以下四个表:

  1. sales_csv
  2. customer_csv
  3. customer_parquet
  4. sales_parquet

Saved queries中已保存了相关的四个SQL语句,打开分别运行它们:

image-20231130084148790

创建csv文件表的SQL如下:

CREATE EXTERNAL TABLE customers_csv (
  card_id bigint,
  customer_id bigint,
  lastname string,
  firstname string,
  email string,
  address string,
  birthday string,
  country string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://athena-workshop-145197526627/basics/csv/customers/'
TBLPROPERTIES (
'areColumnsQuoted'='false',
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'=',',
'skip.header.line.count'='1',
'typeOfData'='file');

CREATE EXTERNAL TABLE sales_csv(
card_id bigint,
customer_id bigint,
price string,
product_id bigint,
timestamp string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://athena-workshop-145197526627/basics/csv/sales/'
TBLPROPERTIES (
'areColumnsQuoted'='false',
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'=',',
'skip.header.line.count'='1',
'typeOfData'='file'); 

其中创建parquet文件的表的语句里面有三个SQL,要先选中Create external table语句才能运行:

img

CREATE EXTERNAL TABLE `customers_parquet`(
card_id bigint,
customer_id bigint,
lastname string,
firstname string,
email string,
address string,
birthday string)
PARTITIONED BY (
`country` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://athena-workshop-145197526627/basics/parquet/customers/'
TBLPROPERTIES (
'classification'='parquet',
'compressionType'='none',
'partition_filtering.enabled'='true',
'typeOfData'='file');

MSCK REPAIR TABLE customers_parquet;

SHOW PARTITIONS customers_parquet;

然后选择文本MSCK REPAIR TABLE customers_parquet并单击 RUN

image-20231130084609655

选择文本 SHOW PARTITIONS customers_parquet并单击 RUN :

image-20231130084623121

sales_parquet表同样如此操作。

Athena中的数据分区

通过对数据进行分区,可以限制每个查询扫描的数据量,从而提高性能并降低成本。

可以按任何键对数据进行分区,常见的做法是根据日期或时间对数据进行分区,这通常会导致多级分区方案。

Athena 可以使用 Apache Hive 样式分区,其数据路径包含通过等号连接的键值对(例如,country=us/... year=2021/month=01/day=26/...)。因此,路径包括分区键的名称和每个路径代表的值。要将新的 Hive 分区加载到分区表中,可以使用 MSCK REPAIR TABLE 命令,该命令仅适用于 Hive 样式的分区。

Athena 还可以使用非 Hive 风格的分区方案。例如,CloudTrail 日志和 Kinesis Data Firehose对日期部分使用单独的路径,例如 data/2021/01/26/us/6fc7845e.json。对于非 Hive 兼容的数据,可以使用 ALTER TABLE ADD PARTITION 手动添加分区。

比较表之间的性能

现在我们已经创建了表,我们将运行一些查询来测试表之间的性能:

  1. 单击 “Saved Queries” 选项卡
  2. 单击名为 Athena_Compare_Salesr的查询:

img

SQL内容如下:

/* Lets find the top 10 products for Australia from the CSV table*/
select c.country, s.product_id, count(timestamp) as total_transactions
from customers_csv c join sales_csv s on c.customer_id = s.customer_id
where country ='Australia'
group by c.country,s.product_id
order by 1,3 desc
limit 10;

/* Lets find the top 10 products for Australia from the parquet tables*/
select c.country, s.product_id, count(timestamp) as total_transactions
from customers_parquet c join sales_parquet s on c.customer_id = s.customer_id
where country ='Australia'
group by c.country,s.product_id
order by 1,3 desc
limit 10;

/* Lets find the top 10 biggest spending customers from the sales CSV table */
select customer_id, sum(cast(price as decimal(6,2))) as total_sales
from sales_csv s 
group by customer_id
order by 2 desc
limit 10;

/* Lets find the top 10 biggest spending customers from the sales parquet table */
select customer_id, sum(cast(price as decimal(6,2))) as total_sales
from sales_parquet s 
group by customer_id
order by 2 desc
limit 10;

通过一次选择一个 SQL 语句块并单击“运行”按钮来按顺序运行查询。在“Query results”部分中,记下“运行时间”和“扫描数据”值:

image-20231130085256305

我们看到,查询 parquet 文件时扫描的数据量少于查询 CSV 文件时的数据量。查询时间parquet更长,是因为数据量太小,当处理具有更多列和行的文件时,parquet通常会更快:

Query Table Time Taken Data Scanned
Top Ten Products by transaction count in Australia customers_csv,sales_csv 1.74 sec 22.97 MB
Top Ten Products by transaction count in Australia customers_parquet,sales_parquet 3.07 sec 5.35 MB
Top Ten Customers By Total Spend sales_csv 0.83 sec 12.46 MB
Top Ten Customers By Total Spend sales_parquet 2.04 sec 2.14 MB