Recently, I encountered a bug. I tried to save a UTF-8 string in MariaDB encoded with "utf8" through Rails, and then a strange error occurred:
Incorrect string value: ‘\xF0\x9F\x98\x83 <…’ for column ‘summary’ at row 1
I use a UTF-8 encoded client, the server is also UTF-8 encoded, and the database is also UTF-8. Even the string "<..." to be saved is also valid UTF-8.
The crux of the problem is that MySQL's "utf8" is not really UTF-8.
"Utf8" only supports up to three bytes per character, while the real UTF-8 is up to four bytes per character.
MySQL has never fixed this bug. In 2010, they released a character set called "utf8mb4" to bypass this problem.
Of course, they did not advertise the new character set (probably because this bug made them feel embarrassed), so that now the network is still recommending developers to use "utf8", but these suggestions are wrong.
The brief summary is as follows:
I want to clarify here that all MySQL and MariaDB users who are using "utf8" should use "utf8mb4" instead of "utf8".
What is coding? What is UTF-8?
As we all know, computers use 0 and 1 to store text. For example, if the character "C" is saved as "01000011", the computer needs to go through two steps when displaying this character:
The computer reads "01000011" and gets the number 67, because 67 is encoded as "01000011".
The computer looked for 67 in the Unicode character set and found "C".
alike:
My computer maps "C" to 67 in the Unicode character set.
My computer codes 67 as "01000011" and sends it to the Web server.
Almost all network applications use the Unicode character set, because there is no reason to use other character sets.
The Unicode character set contains millions of characters. The simplest encoding is UTF-32, and each character uses 32 bits. This is the easiest way to do it, because all along, computers treat 32 bits as numbers, and computers are best at processing numbers. But the problem is that it wastes too much space.
UTF-8 can save space. In UTF-8, the character "C" only needs 8 bits, and some uncommon characters, such as "", need 32 bits. Other characters may use 16 or 24 bits. For an article like this one, if UTF-8 coding is used, the space occupied is only about a quarter of that of UTF-32.
MySQL's "utf8" character set is not compatible with other programs. Its so-called "" may really be a bunch of
Why do MySQL developers disable "utf8"? We may find the answer from the submission log.
MySQL has supported UTF-8 since version 4.1, that is, in 2003, and the UTF-8 standard (RFC 3629) used today is later.
The old UTF-8 standard (RFC 2279) supports up to 6 bytes per character. On March 28, 2002, MySQL developers used RFC 2279 in the first MySQL 4.1 preview.
In September of the same year, they made an adjustment to the MySQL source code: "UTF8 now only supports 3 byte sequences at most".
Who submitted the code? Why did he do this? This question is unknown. After migrating to Git (MySQL started with BitKeeper), many names of committers in the MySQL code base were lost. There was no clue to explain this change in the September 2003 mailing list.
But I can try to guess.
In 2002, MySQL made a decision: if users can ensure that every row of the data table uses the same number of bytes, MySQL can greatly improve its performance. To do this, you need to define the text column as "CHAR". Each "CHAR" column always has the same number of characters. If the number of characters inserted is less than the defined number, MySQL will fill the space behind. If the number of characters inserted exceeds the defined number, the excess part will be truncated.
When MySQL developers first tried UTF-8, they used 6 bytes per character, CHAR (1) used 6 bytes, CHAR (2) used 12 bytes, and so on.
It should be said that their initial behavior was correct, but this version has not been released. However, it is written in the document, and it is widely spread. All people who know UTF-8 agree with what is written in the document.
However, it is obvious that MySQL developers or vendors are worried that users will do these two things:
Use CHAR to define columns (CHAR is an antique now, but at that time, it would be faster to use CHAR in MySQL, but it has not been like this since 2005).
Set the code of the CHAR column to "utf8".
My guess is that MySQL developers wanted to help users who wanted to win in space and speed, but they messed up the "utf8" code.
So there was no winner. Users who want to win in space and speed, when they use the CHAR column of "utf8", actually use more space than expected, and the speed is slower than expected. Users who want to be correct cannot save characters like "" when they use "utf8" encoding.
After the illegal character set is published, MySQL cannot repair it, because it requires all users to rebuild their databases. Finally, MySQL re released "utf8mb4" in 2010 to support true UTF-8.
Why does this make people so crazy
Because of this problem, I have been crazy for a whole week. I was fooled by "utf8". It took me a lot of time to find this bug. But I must not be the only one. Almost all articles on the Internet regard "utf8" as the real UTF-8.
"Utf8" is only a proprietary character set. It has brought us new problems, but it has not been solved.
If you are using MySQL or MariaDB, do not use "utf8" encoding, but use "utf8mb4" instead. Here is a guide for changing the character code of the existing database from "utf8" to "utf8mb4":
//mathiasbynens.be/notes/mysql-utf8mb4#utf8 -to-utf8mb4