Athena Workshop > Athena Workshop > 创建并查询表

创建并查询表

我们将在 Athena 基础实验室中使用以下数据集。

Customers- 一组综合客户记录（~ 96,000）
Sales- 一组客户的综合销售记录。(~287,000)

创建表并运行查询

我们先创建以下四个表：

sales_csv
customer_csv
customer_parquet
sales_parquet

Saved queries中已保存了相关的四个SQL语句，打开分别运行它们：

创建csv文件表的SQL如下：

CREATE EXTERNAL TABLE customers_csv (
  card_id bigint,
  customer_id bigint,
  lastname string,
  firstname string,
  email string,
  address string,
  birthday string,
  country string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://athena-workshop-145197526627/basics/csv/customers/'
TBLPROPERTIES (
'areColumnsQuoted'='false',
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'=',',
'skip.header.line.count'='1',
'typeOfData'='file');

CREATE EXTERNAL TABLE sales_csv(
card_id bigint,
customer_id bigint,
price string,
product_id bigint,
timestamp string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://athena-workshop-145197526627/basics/csv/sales/'
TBLPROPERTIES (
'areColumnsQuoted'='false',
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'=',',
'skip.header.line.count'='1',
'typeOfData'='file');

其中创建parquet文件的表的语句里面有三个SQL，要先选中Create external table语句才能运行：

CREATE EXTERNAL TABLE `customers_parquet`(
card_id bigint,
customer_id bigint,
lastname string,
firstname string,
email string,
address string,
birthday string)
PARTITIONED BY (
`country` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://athena-workshop-145197526627/basics/parquet/customers/'
TBLPROPERTIES (
'classification'='parquet',
'compressionType'='none',
'partition_filtering.enabled'='true',
'typeOfData'='file');

MSCK REPAIR TABLE customers_parquet;

SHOW PARTITIONS customers_parquet;

然后选择文本MSCK REPAIR TABLE customers_parquet并单击 RUN

选择文本 SHOW PARTITIONS customers_parquet并单击 RUN ：

sales_parquet表同样如此操作。

Athena中的数据分区

通过对数据进行分区，可以限制每个查询扫描的数据量，从而提高性能并降低成本。

可以按任何键对数据进行分区，常见的做法是根据日期或时间对数据进行分区，这通常会导致多级分区方案。

Athena 可以使用 Apache Hive 样式分区，其数据路径包含通过等号连接的键值对（例如，country=us/... 或year=2021/month=01/day=26/...）。因此，路径包括分区键的名称和每个路径代表的值。要将新的 Hive 分区加载到分区表中，可以使用 MSCK REPAIR TABLE 命令，该命令仅适用于 Hive 样式的分区。

Athena 还可以使用非 Hive 风格的分区方案。例如，CloudTrail 日志和 Kinesis Data Firehose对日期部分使用单独的路径，例如 data/2021/01/26/us/6fc7845e.json。对于非 Hive 兼容的数据，可以使用 ALTER TABLE ADD PARTITION 手动添加分区。

比较表之间的性能

现在我们已经创建了表，我们将运行一些查询来测试表之间的性能：

单击 “Saved Queries” 选项卡
单击名为 Athena_Compare_Salesr的查询：

SQL内容如下：

/* Lets find the top 10 products for Australia from the CSV table*/
select c.country, s.product_id, count(timestamp) as total_transactions
from customers_csv c join sales_csv s on c.customer_id = s.customer_id
where country ='Australia'
group by c.country,s.product_id
order by 1,3 desc
limit 10;

/* Lets find the top 10 products for Australia from the parquet tables*/
select c.country, s.product_id, count(timestamp) as total_transactions
from customers_parquet c join sales_parquet s on c.customer_id = s.customer_id
where country ='Australia'
group by c.country,s.product_id
order by 1,3 desc
limit 10;

/* Lets find the top 10 biggest spending customers from the sales CSV table */
select customer_id, sum(cast(price as decimal(6,2))) as total_sales
from sales_csv s 
group by customer_id
order by 2 desc
limit 10;

/* Lets find the top 10 biggest spending customers from the sales parquet table */
select customer_id, sum(cast(price as decimal(6,2))) as total_sales
from sales_parquet s 
group by customer_id
order by 2 desc
limit 10;

通过一次选择一个 SQL 语句块并单击“运行”按钮来按顺序运行查询。在“Query results”部分中，记下“运行时间”和“扫描数据”值:

我们看到，查询 parquet 文件时扫描的数据量少于查询 CSV 文件时的数据量。查询时间parquet更长，是因为数据量太小，当处理具有更多列和行的文件时，parquet通常会更快：

Query	Table	Time Taken	Data Scanned
Top Ten Products by transaction count in Australia	customers_csv,sales_csv	1.74 sec	22.97 MB
Top Ten Products by transaction count in Australia	customers_parquet,sales_parquet	3.07 sec	5.35 MB
Top Ten Customers By Total Spend	sales_csv	0.83 sec	12.46 MB
Top Ten Customers By Total Spend	sales_parquet	2.04 sec	2.14 MB