[Hadoop/Impala] File Descriptor & Data block count 문제

하둡을 운영할 때, small files problem, increasing data block count 등 이러한 문제를 보는 경우가 있다.

Hadoop의 block size는 128MB인데 파일 사이즈가 128MB보다 작으면 성능이 떨어진다. 해당 문제가 있으면 다음과 같은 문제를 유발 할 수 있다. Hdfs에서 데이터를 찾거나 MapReduce 작업 시 Namenode, AM이 관리해야 할 block 수가 많아지기 때문에 성능이 떨어진다. 물론 여러 해결책 중 Clouder Manager 혹은 Hadoop에서 limit(임계치)를 올리는 방법이 있을 수 있지만 해당 방안은 최후로 쓰는게 좋을 듯하다. 여러 방안을 찾다가 찾아낸 해결책이다.

1. [Hive]hive 작업 시 작은 파일들 합치기

smallfiles.avgsize보다 작은 파일들을 최대 256mb로 합치는 작업입니다.

SET hive.merge.mapfiles = true;
SET hive.merge.mapredfiles = true;
SET hive.merge.size.per.task = 256000000;
SET hive.merge.smallfiles.avgsize = 134217728;
SET hive.exec.compress.output = true;
SET parquet.compression = snappy;
INSERT OVERWRITE TABLE db_name.table_name SELECT * FROM db_name.table_name;

2. [Impala]파티션 세세하게 & /* +SHUFFLE */ hint 사용

1) 파티션을 더 상세하게 나누기

year/month -> year/month/day

2) 파티션된 테이블 일 때, insert 작업 시 한 번에 하나의 노드만 파티션에 쓰므로 전체 동시 쓰기 수가 줄고, 파일 개수를 줄일 수 있습니다.

3. set num_nodes=1

INSERT, CTAS 작업 시 하나의 data node에 최소 하나 이상의 file이 생성되게 하면서 파일 개수를 줄일 수 있습니다.

4. 테이블 압축 형식 전환

데이터 파일 형식에 따라 압축률이 굉장히 많이 다르지만 기본적으로 parquet이 spark나 다른 플랫폼에도 많이 사용되고 있습니다. parquet자체는 uncompressed가 default이기 여러 압축 라이브러리를 사용할 수 있습니다. 구글에서 개발한 snappy와 gzip이 대표적입니다. 대표적인 차이는 snappy는 높은 압축률은 아니지만 compress/uncompress 성능이 좋아요 mapreduce작업 시 빠른편입니다. 반면 gzip은 압축률 자체가 높지만 Mapreduce작업시 MapReduce Total cumulative CPU time을 보면 snappy일 때보다 좋지 않다. snappy, gzip 등은 라이브러리이기 때문에 어느 데이터파일형식과 사용하는지 그리고 데이터 특성에 따라 성능이 다르기 때문에 직접 테스트 하는 것을 권장한다. Reference에 여러 두 라이브러리를 비교하는 글을 첨부하였습니다.

Reference

https://data-flair.training/forums/topic/what-is-small-file-problem-in-hadoop/

https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_parquet.html#parquet_compacting

https://docs.cloudera.com/best-practices/latest/impala-performance/topics/bp-impala-avoiding-small-files.html

Using the Parquet File Format with Impala Tables | 6.3.x | Cloudera Documentation

Here is a final example, to illustrate how the data files using the various compression codecs are all compatible with each other for read operations. The metadata about the compression format is written into each data file, and can be decoded during queri

docs.cloudera.com

Avoiding small files

To reduce the amount of memory used by the Catalog for metadata, avoid creating many small files in HDFS. By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc. Small files in HDFS can be caused by either having partitions that are too

docs.cloudera.com

https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_parquet.html#parquet_compacting

Using the Parquet File Format with Impala Tables | 6.3.x | Cloudera Documentation

docs.cloudera.com

(snappy, gzip 차이 / 성능 테스트)

https://wikidocs.net/131503
https://www.adaltas.com/en/2021/03/22/performance-comparison-of-file-formats/
https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_parquet.html#parquet_compression

Using the Parquet File Format with Impala Tables | 6.3.x | Cloudera Documentation

When Impala writes Parquet data files using the INSERT statement, the underlying compression is controlled by the COMPRESSION_CODEC query option. (Prior to Impala 2.0, the query option name was PARQUET_COMPRESSION_CODEC.) The allowed values for this query

docs.cloudera.com

https://github.com/google/snappy

'빅데이터 > Hadoop' 카테고리의 다른 글

Block Count / Small files 해결 (0)	2022.09.24
MapReduce 원리 (0)	2022.09.22
Hadoop HA 아키텍처 (1)	2022.09.21
Yarn 동작 원리 & Scheduler (0)	2022.09.21
하둡 아키텍처 참고 글 (0)	2022.09.09

min 블로그

[Hadoop/Impala] File Descriptor & Data block count 문제

'빅데이터 > Hadoop' 카테고리의 다른 글

티스토리툴바

[Hadoop/Impala] File Descriptor & Data block count 문제

'빅데이터 > Hadoop' 카테고리의 다른 글

'빅데이터/Hadoop' Related Articles

티스토리툴바