When you’re builing international websites, there’s always something new to learn. Especially if one of the languages your website is available in uses a character set different from anything you’re used to. For jimdo.com, the greatest challenge as of yet is the chinese version.
Jimdo allows to define tags for your website. You can separate the tags with whitespace, but it’s also possible to use commas, like this:
tag1, tag2, tag3
Chinese users naturally are way more used to using UTF-8 characters than us westerners, and, lo and behold, UTF-8 has its own special comma character with integrated whitespace, that is quite frequently used by chinese users:
科学,思考,心情
As we’re using good ol’ regular expressions to split up the tag strings into single tags, one might think, “no problem, I’ll just add another character to the regex pattern”, like so:
$tags = preg_split("/[\s,;:,]+/", $input, null, PREG_SPLIT_NO_EMPTY);
And heureka, it works! Or does it? Nope. As UTF-8 works with multiple bytes per character and preg_split, like so many other current PHP functions, thinks of one byte = one character, you may encounter strange side-effects. Here’s an example using the above pattern on a random string with some German umlauts:
Splitting up 'Bääh Blöök Dübel', becomes:
Array
(
[0] => Bääh
[1] => Blöök
[2] => D�
[3] => bel
)
What to do? Simple: Add the unicode modifier, “u”, to the pattern:
$tags = preg_split(”/[\s,;:,]+/u”, $input, null, PREG_SPLIT_NO_EMPTY);
Now preg_split correctly recognizes multibyte characters and yields the expected results:
Splitting up 'Bääh Blöök Dübel', becomes:
Array
(
[0] => Bääh
[1] => Blöök
[2] => Dübel
)
Another lesson learned.


September 19th, 2007 at 18:30
First of all, there is no such thing as a “UTF-8 character”. UTF-8 is an encoding, not a character set. It should be “Unicode character”.
Secondly, an easier way of splitting on punctuation without regard to fullwidth or halfwidth forms is to use character properties:
$tags = preg_split(”/[\s\p{P}]+/u”, $input, -1, PREG_SPLIT_NO_EMPTY);
Here we split either on whitespace or any punctuation (of which the fullwidth comma is one).
September 20th, 2007 at 16:43
So, what’s the problem with commas here? In the second example you are splitting by spaces not commas!
September 20th, 2007 at 17:18
@Andrei: Sorry, I can imagine that by now you must have lost all patience with people confusing character sets with encodings and vice versa - next time I blog about something like this, I’ll pay more attention :-)
And thanks for the tip regarding the character class operator - didn’t know that one!
@Dennis: Actually there is no problem with commas as such - the article was meant to express my bewilderment with the fact that there is such a thing as a special unicode comma character with integrated whitespace, which I had never seen before. The rest is more a documentation of the oddities that can happen when you just try to add a unicode character to a regex without making the engine unicode-aware, so that others encountering similar problems can be warned :-)
September 20th, 2009 at 07:21
Useful post. Also note that ‘、’ can be used to separate some lists in Chinese, depending on the type of list. This is Unicode character U-3001 and should be treated as you already discussed with ‘,’.
June 19th, 2010 at 10:01
The rest is more a documentation of the oddities that can happen when you just try to add a unicode character to a regex without making the engine unicode-aware, so that others encountering similar problems can be warned
July 5th, 2010 at 03:42
Acheter 2 pcs, vous pouvez obtenir 5% de remise, le code de coupon: order2
Acheter 3 pcs, vous pouvez obtenir 8% de remise, le code de coupon: order3
Plus de 4 pcs, Vous pouvez nous contcat pour plus de rabais.
website/site: http://www.air-max-shox.com
August 5th, 2010 at 02:04
I took 1 st loans when I was 20 and this aided my relatives very much. Nevertheless, I need the collateral loan over again.
August 25th, 2010 at 11:21
i think the product is so good ! Dont miss it ! come and buy one for yourself !
September 2nd, 2010 at 10:42
this is right!!!!!!!!!!!!!!