Why do computers sometimes change special characters like “&” into “&” or “ ”

11 views

Why do computers sometimes change special characters like “&” into “&” or “ ”

In: 77

“&” is the HTML entity for an ampersand “&” it’s mistake caused by decoding them literally rather than into their proper special character.

You will see this for example in URL, where the & character has a special meaning for the computer. In this context, & Is use as a separator between two character string.

But then how do you have & as part of the string? Well you replace it with &amp (amp for ampersand, it is the name of the & character)

Same for &nbsp, which means non breaking space

This phenomenon is exclusive to the HTML format used on the world-wide web. It defines a set of reserved characters to express its syntax, to indicate the start of code words. An example of a code word is <em> for start of *emphasized text*. Other unusual characters are unsafe to use or may hold a special meaning in file systems or other computer languages on the back end of the web server.

To transmit these characters they are substituted by these replacement strings called entities. Every web browser has a list of them, and replaces them with the corresponding character before displaying the web page. The mathematical symbols <> are “less than” and “greater than” are encoded as &lt; and &gt;. &amp; stands for “ampersand” and &nbsp; is a non-breaking space, which doesn’t allow to wrap text to a new line at that point.

When you see these special strings, a mistake has happened in the code of the website. It may have converted a character twice. & -> &amp; -> &amp;amp; When the browser receives the string “&amp;amp;”, it performs the task of restoring the complete “entity” in the first 5 characters, resulting in “&amp;” being output. Other possibilities is the encoding of the ampersand using a numerical character code, or omission of the semicolon.

This happens because of a bug in the software being used. The content has been “double escaped”

In the code that is used to generate webpages (HTML) the & character is used in a special way that tells the browser to render specific characters. `&nbsp;` for example is a code that tells the browser to render a space character that cannot be used as a break point for word-wrapping and multiples cannot be collapsed together. Because the & is used for these special codes, if you want to write an actual & character, you need to use a special code for it which is `&amp;`

When you see these codes in the text rather than what they are supposed to represent, that is a result of the software double-processing. For example, the first time it processes the text it converts & into `&amp;`. The second time, it converts `&amp;` into `&amp;amp;`. Your browser then renders the `&amp;` as & followed by the amp;

In HTML (the language used to make a website), & has a special meaning. It’s used to signal that you’re about to encode an unusual character (called an HTML entity) such as `&permil;` which turns into ‰. But what if you want to just show an & without it getting interpreted as having a special meaning? In that case, you need to encode it using & like this: `&amp;` (for ampersand). When you’re writing text which is going to be converted to HTML, all special characters need to be converted into HTML entities in order to be displayed correctly. Unfortunately, sometimes this process happens twice in a row by mistake and you end up having the ampersand, that is there as proper HTML, being converted again. So you end up with & turning into `&amp;` which turns into `&amp;amp;` resulting in what you see.

Various computer languages use characters like &, <, >, *, etc for special purposes. One of those languages is HTML which is used to make webpages. But what happens if you want to show one of those characters on the web page? The language is also designed with sequences called escape characters, which are an alternate way to tell the web page to display the & (or other) character normally instead of using it as code.

However if you display that document in something that isn’t designed to use those special codes it will show you the original text including the escape sequences like &amp;

It’s encoding. It represents characters, which could be volatile, as benign strings.

A good example of this is < and > characters. Left as they are could easily break an HTML web page or XML document. Encoded, they are harmless.

When *we* see them, it’s because decoding failed or the page didn’t load correctly.

You want to write the symbol < but the browser thinks that you are trying to write something special and everything after that character is broken and not correct anymore. So with the special code starting with & for the symbol < (& l t ;), you are writing < and at the same time you are telling the browser that you just want the symbol written for the user to be read and that you don’t want the special symbol that will broke everything.

Now you are seeing < but in reality it is an & l t ;. That’s also the reason why sometimes when you copy the text you copy the special char code and not what you are reading.

There are certain characters such as / & $ # that are used as special characters in programming languages. Sometimes when those characters are used in a text entered by users it can have unintended consequences and they can be interpreted as commands (malicious users can perform what is called an injection attack). Internally programs mark or reencode these special characters to distinguish text from actual commands, this is specially true for text based languages such as html. Things do not always work as intended and these reencoded characters sometimes are shown to the final user.