inicio sindicaci;ón

A comma is a comma is a comma… or is it?

When you’re builing international websites, there’s always something new to learn. Especially if one of the languages your website is available in uses a character set different from anything you’re used to. For jimdo.com, the greatest challenge as of yet is the chinese version.

Jimdo allows to define tags for your website. You can separate the tags with whitespace, but it’s also possible to use commas, like this:

tag1, tag2, tag3

Chinese users naturally are way more used to using UTF-8 characters than us westerners, and, lo and behold, UTF-8 has its own special comma character with integrated whitespace, that is quite frequently used by chinese users:

科学,思考,心情

As we’re using good ol’ regular expressions to split up the tag strings into single tags, one might think, “no problem, I’ll just add another character to the regex pattern”, like so:

$tags = preg_split("/[\s,;:,]+/", $input, null, PREG_SPLIT_NO_EMPTY);

And heureka, it works! Or does it? Nope. As UTF-8 works with multiple bytes per character and preg_split, like so many other current PHP functions, thinks of one byte = one character, you may encounter strange side-effects. Here’s an example using the above pattern on a random string with some German umlauts:


Splitting up 'Bääh Blöök Dübel', becomes:
Array
(
[0] => Bääh
[1] => Blöök
[2] => D�
[3] => bel
)

What to do? Simple: Add the unicode modifier, “u”, to the pattern:

$tags = preg_split(”/[\s,;:,]+/u”, $input, null, PREG_SPLIT_NO_EMPTY);

Now preg_split correctly recognizes multibyte characters and yields the expected results:


Splitting up 'Bääh Blöök Dübel', becomes:
Array
(
[0] => Bääh
[1] => Blöök
[2] => Dübel
)

Another lesson learned.

9 Kommentare to “A comma is a comma is a comma… or is it?”

  1. Andrei Says:

    First of all, there is no such thing as a “UTF-8 character”. UTF-8 is an encoding, not a character set. It should be “Unicode character”.

    Secondly, an easier way of splitting on punctuation without regard to fullwidth or halfwidth forms is to use character properties:

    $tags = preg_split(”/[\s\p{P}]+/u”, $input, -1, PREG_SPLIT_NO_EMPTY);

    Here we split either on whitespace or any punctuation (of which the fullwidth comma is one).

  2. Dennis Says:

    So, what’s the problem with commas here? In the second example you are splitting by spaces not commas!

  3. Markus Wolff Says:

    @Andrei: Sorry, I can imagine that by now you must have lost all patience with people confusing character sets with encodings and vice versa - next time I blog about something like this, I’ll pay more attention :-)
    And thanks for the tip regarding the character class operator - didn’t know that one!

    @Dennis: Actually there is no problem with commas as such - the article was meant to express my bewilderment with the fact that there is such a thing as a special unicode comma character with integrated whitespace, which I had never seen before. The rest is more a documentation of the oddities that can happen when you just try to add a unicode character to a regex without making the engine unicode-aware, so that others encountering similar problems can be warned :-)

  4. Dalian Says:

    Useful post. Also note that ‘、’ can be used to separate some lists in Chinese, depending on the type of list. This is Unicode character U-3001 and should be treated as you already discussed with ‘,’.

  5. ClubPenguin Says:

    The rest is more a documentation of the oddities that can happen when you just try to add a unicode character to a regex without making the engine unicode-aware, so that others encountering similar problems can be warned

  6. cheap mbt shoes Says:

    Acheter 2 pcs, vous pouvez obtenir 5% de remise, le code de coupon: order2
    Acheter 3 pcs, vous pouvez obtenir 8% de remise, le code de coupon: order3
    Plus de 4 pcs, Vous pouvez nous contcat pour plus de rabais.
    website/site: http://www.air-max-shox.com

  7. BenitaJACOBSON35 Says:

    I took 1 st loans when I was 20 and this aided my relatives very much. Nevertheless, I need the collateral loan over again.

  8. Nike Air Max Says:

    i think the product is so good ! Dont miss it ! come and buy one for yourself !

  9. dengxixian Says:

    this is right!!!!!!!!!!!!!!

Leave a Reply

*
To prove that you're not a bot, enter this code
Anti-Spam Image