For the system to process the partitioned file in the most efficient manner, there are some tips you can consider when setting up or using a partitioning key.
These tips are:
Gender, because only two choices exist, male or female, is an example of a poor choice for a partitioning key. Gender causes too much data to be distributed to a single node instead of spread across the nodes. Also, when doing a query, gender as the partitioning key causes the system to process through too many records of data. It is inefficient; another field or fields of data can narrow the scope of the query and make it much more efficient. A partitioning key based on gender is a poor choice in cases where even distribution of data is wanted rather than distribution based on specific values.
When preparing to change a local file into a distributed file, you can use the HASH function to get an idea of how the data is distributed. Because the HASH function can be used against local files and with any variety of columns, you can try different partitioning keys before actually changing the file to be distributed. For example, if you plan to use the ZIP code field of a file, you can run the HASH function using that field to get an idea of the number of records that HASH to each partition number. This helps you in choosing your partitioning key fields, or in creating the partition map in your node groups.