XML Sitemap Files in UTF-8 or ASCII Character Format
The sitemaps protocol defines that XML sitemap documents must be UTF-8 and contain no characters outside ASCII range.
Some UTF-8 files may start with a socalled
BOM (byte order mark)
to identify it as a unicode UTF-8 document file.
The BOM is not required for XML or UTF-8 documents. It just helps most unicode tools to handle the unicode text correctly.
(Although ASCII only compliant document parsers may choke at it.)
The BOM for UTF-8 looks like this in hexadecimal:
$EF $BB $BF. To view the BOM in XML document files such as sitemaps,
you will need to use tools such as
hex editors.
You can configure how the sitemap generator software creates XML sitemaps.
In
Create sitemap | Document options | Character set and type you find options:
- Always save sitemap files as UTF-8.
- Save UTF-8 sitemap files with BOM.
The sitemaps protocol defines that all non-ASCII characters are to be URL encoded even though the XML sitemap file is defined as UTF-8.
That is not a problem as ASCII is a subset of UTF-8.
To read more, check our article about
XML sitemaps URL encoding.