A small tip
If you’re using Kafka to manage your data, it’s important to keep track of the disk usage of your topics to ensure that you have enough storage space. One way to estimate the disk usage of Kafka topics is by using the following command:
du -scb /var/lib/kafka/data/* | sed 's/.\/var\/lib\/kafka\// /g' | awk -F'-' '{print $1}' | awk '{print $2, $1}' | awk '{arr[$1]+=$2} END {for (i in arr) {printf("%s\t%.2f\n"),i,arr[i]/1024/1024/1024.0}}' | sort -k 2 -g -r
This command uses several Unix utilities to estimate the disk usage of Kafka topics and sort them based on their total disk usage, from highest to lowest. Here’s what each part of the command does:
du -scb /var/lib/kafka/data/*
: Estimates the file space usage of each file in the Kafka data directory and displays the total disk usage in bytes.sed 's/.\/var\/lib\/kafka\// /g'
: Replaces the Kafka data directory path with a space character to make the output more readable.awk -F'-' '{print $1}'
: Extracts the topic name from the file path by printing the first field of each line, using the dash character as the field separator.awk '{print $2, $1}'
: Switches the order of the two fields (topic name and disk usage) in each line of the output to prepare the output for the next step.awk '{arr[$1]+=$2} END {for (i in arr) {printf("%s\t%.2f\n"),i,arr[i]/1024/1024/1024.0}}'
: Calculates the total disk usage for each topic by adding up the disk usage of all its partitions. The output is in gigabytes (GB), with two decimal places.sort -k 2 -g -r
: Sorts the output based on the second field (total disk usage) in reverse numerical order.By using this command, you can get a better understanding of how much disk space your Kafka topics are using and take steps to manage your data storage more effectively.