PhpRiot
News Archive
PhpRiot Newsletter
Your Email Address:

More information

Filesystem encoding and PHP

Note: This article was originally published at Planet PHP on 2 January 2011.
Planet PHP

Many PHP applications save files to a local filesystem. Most of the times for the bulk of readers here you'll likely only ever store files using US-ASCII encoding, either because your filenames are simply based on databasefields (as you should try in most cases), or simply because most of your users never have a need for non-english characters.

When you do though, it's important to know how operating systems cope with these characters. Unsurprising, all of them do this differently.

To illustrate the differences, I'm going to do some tests on Ubuntu, OS/X 10.6.3 and Windows XP and 7.

Linux

In Linux filenames are binary. Linux does not care what encoding your filenames are, and it will accept anything besides 0x00. This means filenames can contain carriage-returns (\n), tabs (\t) or even a bell (ascii code 07).

To illustrate this, I'm going to make a tiny file using a php script:

  1. file_put_contents("saved by the \x07.txt","contents");
  2. ?

After running this I simply get a questionmark when viewing the file using 'ls', but when I auto-complete it, it expands to ^G (which is bell). In Nautilus, this is displayed:

If I run this script:

  1. print_r(glob('saved*'));
  2. ?

The output is simply missing my bell character, and I get a short beep.

This doesn't mean it's a good idea to do this. Even though the underlying filesystem is binary-safe, applications that list filenames will still have to make a decision on an encoding to display the characters to the user. You can't even show this character in any PHP page, and firewalls might even block this if you used this in a url.

This also applies to the applications on your linux machine. Most of them, such as Gnome Terminal and Nautilus, default to UTF-8. However, I believe for the PuTTY application this was for the longest time ISO-8859-1 (latin1). A symptom of this is that any non-ascii characters look different when read them from Putty vs. Nautilus.

The other thing I wanted to test on linux is how it behaves if I create a file in the filemanager using a special character. For this example I'm using A, because it's a bit ambiguous as there's multiple ways to encode it using unicode (more on this later) and it also appears in ISO-8859-1.

Back to the test. I'm now creating a new file from the Nautilus interface, and want to see how it shows up for PHP. Im creating a file called test_A.txt and listing it with the following script:

  1. list($file) = glob('test_*');
  2. echo urlencode($file) . "\n";
  3. ?

Output:

  1. test_%C3%BC.txt

%C3%BC is the UTF-8 encoding of codepoint U+00FC, which is the most common way to encode A. Great!

The last test is to create this file using ISO-8859-1/latin1 encoding. The latin1 representa

Truncated by Planet PHP, read more at the original (another 25001 bytes)