
I have spent more than ten years working as a backend developer optimizing MySQL databases and I have witnessed numerous applications experience performance problems because developers lacked knowledge about primary keys. In this article, I am going to explain both fundamental and internal aspects of MySQL primary keys which determine application performance outcomes.
The simple definition of a primary key is one or multiple columns that serve to uniquely identify each row in a table structure. But a primary key represents much more than that. A primary key in MySQL is:
The creation of a primary key results in more than just data constraints because it determines how MySQL stores and retrieves data from disk.
The InnoDB storage engine has become the default in MySQL since version 5.5. It uses primary keys for both logical and physical data storage and access operations. This concept is called "clustered indexing", and it represents a fundamental MySQL concept that you must understand.
Your table data within InnoDB follows a physical disk order based on primary key values. The performance implications of this arrangement are significant.
Consider this example:
CREATE TABLE customers
(
customer_id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL,
PRIMARY KEY (customer_id)
);
The database engine of MySQL arranges newly inserted rows into this table according to the customer_id
order when placed on disk storage. This means:
Let's dive deeper into how MySQL actually stores data.
All indexes including the primary key use a B+Tree data structure within InnoDB. A B+Tree is a balanced tree structure that enables quick searching together with sequential access and efficient insertions and deletions.
For a primary key index:
(Note: This is a conceptual diagram; actual implementation details may vary).
A primary key access enables MySQL to locate a row with minimal disk reads, which typically perform as O(log n) operations relative to the number of rows.
InnoDB stores data through pages, which typically measure 16 KB each. Each page can contain multiple rows, depending on the row size.
The definition of a primary key leads MySQL to perform the following actions:
To optimize large tables, it is crucial to understand the storage method that uses pages. If your rows are 4 KB each, you'll have about 4 rows per page. But if your rows are only 100 bytes, you can fit around 160 rows per page, dramatically reducing the number of disk I/O operations needed.
Many people mistakenly believe that primary keys are optional elements in MySQL. While you can create a table without explicitly defining a primary key, MySQL's InnoDB engine will create a hidden 6-byte primary key called a "Row ID" for tables without one.
Consider this table:
CREATE TABLE notes
(
content TEXT,
created_at TIMESTAMP
);
The InnoDB system automatically generates an internal hidden primary key when no primary key definition exists. The approach presents major disadvantages that should be considered.
The practice of developing tables without primary keys is a common mistake I have observed quite a few times. These tables function properly, but they will eventually cause problems when the table expands in size. All tables need an explicitly defined primary key.
The use of auto-increment primary keys exists in almost every MySQL database.
CREATE TABLE orders
(
order_id INT NOT NULL AUTO_INCREMENT,
customer_id INT NOT NULL,
order_date DATETIME NOT NULL,
PRIMARY KEY (order_id)
);
The advantages are clear:
These indices have several hidden implications that need attention.
The use of auto-increment primary keys creates bottlenecks in systems that experience high concurrency. The table's "end" section develops into a competitive area because new rows are consistently inserted there. The previous versions before 8.0 MySQL maintained table-level auto-increment locks throughout the entire statement execution, which could result in lock contention. The auto-increment lock mode in MySQL 8.0 operates more efficiently by releasing the lock instantly after value generation,n yet it does not eliminate all contention risks.
The index experiences maximum I/O operations when all new insertions take place at its terminal position. Page lock contention, along with reduced performance, becomes a problem when systems experience numerous concurrent writes.
Some developers choose UUIDs as a substitute solution:
CREATE TABLE sessions
(
session_id CHAR(36) NOT NULL,
user_id INT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (session_id)
);
The distribution of writes across the index by UUIDs reduces contention but leads to random insertion points, which cause index fragmentation and degrade range scan performance. Secondary indexes require more space because UUIDs occupy larger storage.
We explored how MySQL's InnoDB storage engine physically organizes data based on primary keys. Now, let's dive deeper into how your primary key choice affects secondary indexes, query performance, and overall database efficiency.
The design of your primary key remains a major performance factor that influences every secondary index in your MySQL table. Database performance relies fundamentally on this relationship.
InnoDB stores secondary indexes with a different structure from the primary key index.
At the leaf level of the primary key index, you will find your actual row data
Secondary indexes store both the indexed columns with their associated primary key values.
Two separate lookups are necessary to retrieve full row data when accessing it through a secondary index.
This two-step process is called a "bookmark lookup", and that forms an essential part of performance optimization. Consider this example:
CREATE TABLE products
(
product_id INT NOT NULL AUTO_INCREMENT,
sku VARCHAR(50) NOT NULL,
name VARCHAR(100) NOT NULL,
price DECIMAL(10, 2) NOT NULL,
PRIMARY KEY (product_id),
INDEX idx_sku (sku)
);
When you query by SKU:
SELECT * FROM products WHERE sku = 'ABC123';
MySQL:
idx_sku
for 'ABC123'product_id
value stored in the index.product_id
to locate the entire row in the primary key index.
The selection of your primary key affects all secondary indexes for this reason:
Let's quantify this with an example. Imagine these two primary key choices:
-- Option 1: INT primary key (4 bytes)
CREATE TABLE orders
(
order_id INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY (order_id),
...
);
-- Option 2: UUID primary key (16 bytes)
CREATE TABLE orders
(
order_id BINARY(16) NOT NULL,
PRIMARY KEY (order_id),
...
);
With 5 secondary indexes and 10 million rows:
The implementation of integer primary keys instead of UUIDs in high-performance systems might result in a significant decrease in database size and a performance boost during read operations.
A composite primary key consists of multiple columns which function to make each row unique. These keys deliver strong functionality yet need careful consideration.
CREATE TABLE order_items
(
order_id INT NOT NULL,
line_number INT NOT NULL,
product_id INT NOT NULL,
quantity INT NOT NULL,
price DECIMAL(10, 2) NOT NULL,
PRIMARY KEY (order_id, line_number)
);
The order_id
along with the line_number
combination serves as the primary key in this example. This has several implications:
order_id,
followed by the line_number
order_id
range queries with exceptional performance.line_number
.Composite primary keys, designed properly, offer built-in ordering benefits. In time-series data, for example:
CREATE TABLE temperature_readings
(
sensor_id INT NOT NULL,
timestamp DATETIME NOT NULL,
temperature DECIMAL(5, 2) NOT NULL,
PRIMARY KEY (sensor_id, timestamp)
);
This structure means:
The correct order of columns in a composite primary key plays a vital role. InnoDB uses the leftmost prefix during efficient lookups while following the "leftmost prefix rule" principle.
sensor_id
in their filter conditionssensor_id
and timestamp
.timestamp
values.
A system optimization through primary key reorganization allowed me to cut query times from seconds down to milliseconds without modifying any other system components.
The fundamental design choice in database development involves a decision between using surrogate keys and natural keys:
CREATE TABLE countries
(
country_id INT NOT NULL AUTO_INCREMENT,
country_code CHAR(2) NOT NULL,
name VARCHAR(100) NOT NULL,
PRIMARY KEY (country_id),
UNIQUE INDEX (country_code)
);
Advantages:
CREATE TABLE countries
(
country_code CHAR(2) NOT NULL,
name VARCHAR(100) NOT NULL,
PRIMARY KEY (country_code)
);
Advantages:
I established these guidelines after optimizing databases for many years.
Use surrogate keys when:
Consider natural keys when:
Through my experience with many MySQL database optimizations, I have established these practical methods for different situations:
Standard auto-increment keys create performance bottlenecks within systems that process many insertions. Options to consider:
-- Each application server uses a different offset
-- Server 1 uses: 1, 101, 201, ...
-- Server 2 uses: 2, 102, 202, ...
INSERT INTO events (id, event_data) VALUES (LAST_INSERT_ID() + 100, 'event data');
This reduces contention while maintaining most benefits of sequential IDs.
CREATE TABLE events
(
day_partition DATE NOT NULL,
event_id BIGINT NOT NULL AUTO_INCREMENT,
event_data JSON,
PRIMARY KEY (day_partition, event_id),
UNIQUE KEY (event_id)
);
This approach:
A well-designed composite key system in data warehousing and analytics leads to significant performance improvements.
CREATE TABLE user_actions
(
user_id INT NOT NULL,
action_time DATETIME NOT NULL,
action_type TINYINT NOT NULL,
action_data JSON,
PRIMARY KEY (user_id, action_time)
);
All user actions are stored together so this structure makes user history queries extremely efficient and presorted by time.
For IoT sensors, monitoring systems, and other time-series workloads:
CREATE TABLE metrics
(
metric_type TINYINT NOT NULL,
device_id INT NOT NULL,
timestamp DATETIME(3) NOT NULL,
value FLOAT NOT NULL,
PRIMARY KEY (metric_type, device_id, timestamp)
);
This design:
During my professional experience, I have witnessed multiple primary key anti-patterns that consistently result in performance issues.
Random UUIDs cause:
When using UUIDs, you should select ordered versions such as ULID or UUID v6/v7 because they maintain time-based ordering properties.
The system I encountered used (client_code
, year
, document_number
) as its primary key structure for invoices. The client code modification triggered cascading updates, which modified millions of rows distributed across multiple tables.
The process of comparing strings requires more time than comparing numeric values. The conversion of VARCHAR customer codes to INT surrogate keys might result in a huge enhancement of join performance.
Every table needs an explicit primary key defined. The InnoDB system generates an invisible primary key when left to its own devices.
Table partitioning enables the division of big tables into smaller sections that remain logically connected. The primary key strategy needs to match exactly with the partitioning strategy.
MySQL requires all unique keys, including primary keys, to contain the partitioning key as one of their components. Unique constraints need to be verifiable within a single partition because of this requirement.
-- This works because order_date is part of the primary key
CREATE TABLE orders
(
order_id INT NOT NULL,
order_date DATE NOT NULL,
customer_id INT NOT NULL,
amount DECIMAL(10, 2) NOT NULL,
PRIMARY KEY (order_date, order_id)
) PARTITION BY RANGE (TO_DAYS(order_date)) (
PARTITION p_2022_q1 VALUES LESS THAN (TO_DAYS('2022-04-01')),
PARTITION p_2022_q2 VALUES LESS THAN (TO_DAYS('2022-07-01')),
PARTITION p_2022_q3 VALUES LESS THAN (TO_DAYS('2022-10-01')),
PARTITION p_2022_q4 VALUES LESS THAN (TO_DAYS('2023-01-01'))
);
In this design:
order_date
serves as the primary key that appears in the partitioning expression.During query execution MySQL uses Partition pruning to bypass partitions that are not needed in the process. The design of a good primary key supports efficient pruning operations.
-- This query can use partition pruning
SELECT *
FROM orders
WHERE order_date BETWEEN '2022-04-01' AND '2022-06-30'
AND customer_id = 1001;
-- This query cannot use partition pruning
SELECT *
FROM orders
WHERE order_id = 5000;
The initial query removes three partitions from consideration thus it focuses its search on p_2022_q2. The second query needs to scan all partitions because order_id by itself fails to identify the partition location of the data.Throughout my career, I have observed how appropriate partition design improves query speeds from minutes to sub-second for large analytical tables when partitioning schemes match the primary key and query patterns.
Database management professionals consider changing primary keys in production to be one of their most difficult tasks. Here are battle-tested approaches:
The method reduces system downtime during primary key structure modifications.
CREATE TABLE products_new
(
product_id INT NOT NULL AUTO_INCREMENT, -- New surrogate key
sku VARCHAR(50) NOT NULL, -- Old primary key
name VARCHAR(255) NOT NULL,
price DECIMAL(10, 2) NOT NULL,
PRIMARY KEY (product_id),
UNIQUE KEY (sku)
);
INSERT INTO products_new (sku, name, price)
SELECT sku, name, price FROM products;
DELIMITER
//
CREATE TRIGGER products_after_insert
AFTER INSERT
ON products
FOR EACH ROW
BEGIN
INSERT INTO products_new (sku, name, price)
VALUES (NEW.sku, NEW.name, NEW.price);
END //
CREATE TRIGGER products_after_update
AFTER UPDATE
ON products
FOR EACH ROW
BEGIN
UPDATE products_new
SET name = NEW.name,
price = NEW.price
WHERE sku = NEW.sku;
END //
CREATE TRIGGER products_after_delete
AFTER DELETE
ON products
FOR EACH ROW
BEGIN
DELETE
FROM products_new
WHERE sku = OLD.sku;
END //
DELIMITER;
4. Update application code to use the new table structure
RENAME TABLE products TO products_old, products_new TO products;
DROP TRIGGER products_after_insert;
DROP TRIGGER products_after_update;
DROP TRIGGER products_after_delete;
6. Update dependent foreign keys and adjust application code as needed
This approach enables primary key modifications on active production systems without causing major disruptions.
These fundamental design principles will help you make decisions about MySQL primary keys after studying them from basic to complex techniques.
The time spent designing primary keys for MySQL databases during my career has proven to deliver substantial benefits when systems expand in size. The primary key functions as a database constraint while serving as the essential base that determines performance and scalability and maintenance requirements.I hope this article provides principles and techniques that enable you to create primary keys that will function effectively now and accommodate upcoming requirements.