Vocademy
Peculiar Characters in Web

Peculiar Characters in Web (or other) Pages

Your web page looks fine, except three funny characters appear somewhere on the page.



What's with that?

This is a UTF-8 byte order mark (BOM). It tells the system reading the file which byte of a Unicode character is the low-order byte. This mark should be at the start of a file. The system (operating system, web browser, etc.) reads and processes the mark but does not show it on the screen.

One place a problem occurs is when a PHP script includes an HTML file with the include function. When the page is sent to the browser, the BOM comes in the middle, so the browser shows the characters and doesn't process them. This can be hard to track down (if you don't already know the cause) because HTML editors and text editors will process the mark and not show it. You will need a hex editor (an editor that displays and edits a file byte-by-byte) to see the mark. (I had to use escape sequences to print the characters above because my web page editor won't show them even if they are in the middle of a page.)

How do you get rid of it? Your HTML or text editor may have the option to exclude the BOM when saving a file. But what if you already have a bunch of files with a problematic BOM on your server? Must you load every one into the editor and resave it?

Here is a simple PHP script that strips the BOM off of all HTML files in the specified directory.

  foreach (glob("<complete path to directory>/*.html") as $FileName) {

  $handle = fopen("$FileName", "r");
  $FileContents = fread($handle, filesize("$FileName"));
  fclose($handle);

  print "$FileName - ";

  if(substr($FileContents,0,3) == ""){
    print "Y<br>";
    $FileContents = substr($FileContents, 3);

    $handle = fopen("$FileName", "w");
    fwrite($handle, $FileContents);
    fclose($handle);
  }else{
    print "N<br>";
  }
}

The script prints each file's path and name and indicates with a Y or N whether the BOM existed (and was removed).

How the script works

The first line sets up a loop that steps through each file name looking for those ending with .html.[1]

foreach (glob("<complete path to directory>/*.html") as $FileName) {

The "<complete path to directory>" parameter must be the full path from the root of your workspace and is unique to your server. It will look something like the following.

/home/fred/public_html/fredssite.com

This must be changed, saved, and the script file re-uploaded for each directory.

Each time the glob function finds an HTML file, the complete path and file name are put into the $FileName variable.

The next three lines open the current file in read-only mode, then put the file contents into the $FileContents variable.

    $handle = fopen("$FileName", "r");
  $FileContents = fread($handle, filesize("$FileName"));
  fclose($handle);

The following line prints the path and file name with spaces and a hyphen, ready to print a Y or N, depending on if the BOM is found in the file.

print "$FileName - ";

Next, the script sets up an if construct that compares the first three bytes of the file with the BOM characters.

if(substr($FileContents,0,3) == ""){

If the BOM is found, the next part of the script prints a Y on the screen, strips the first three characters from the contents of $FileContents, then writes the contents back to the original file.

    print "Y<br>";
  $FileContents = substr($FileContents, 3);

  $handle = fopen("$FileName", "w");
  fwrite($handle, $FileContents);
  fclose($handle);

If the BOM is not found, the script prints N.

That's it.

Prevent saving the BOM

In Microsoft Expression Web you can prevent saving the BOM by going to File, Properties, Language and unchecking the 'Include byte order mark (BOM)' checkbox.

—————————
1The glob function specifically searches file names for matches,
Vocademy