How to use Sed together with UTF-8
One of the eldest Unix utilities is the stream editor Sed. It was born 1973, and is still going strong. If you haven't yet made its acquaintance, I highly recommend that you read Bruce Barnett's Sed -- An Introduction and Tutorial. I use the GNU version of Sed almost everyday.
There is just one problem with Sed. It sometimes get confused by character encoding, in particular UTF-8 and ISO-8859-1.
Problem
If Sed is operating on a character stream with another encoding than the system's default, it can get confused. A typical situation is that you are using a modern Linux distribution, e.g. Ubuntu, which has UTF-8 as the default character set, and you feed Sed with a file encoded with the character set ISO-8859-1 (also known as Latin-1). Let us demonstrate this case.
Create a file example.txt with some Swedish first names encoded in ISO-8859-1:
$ echo -e "Arvid:a\nHåkan:b\nTore:c\nMärta:d\nNils:e\nSören:f" | \
iconv -f utf8 -t latin1 -o example.txt
The resulting file should look as follows on an operating system with UTF-8 as default character set:
$ cat example.txt
Adam:a
H�kan:b
Tore:c
M�rta:d
Nils:e
S�ren:f
Now, let Sed replace any characters up to and including a colon (:) with a dash (-):
$ sed -r 's/.*:/-/' < example.txt
Since dot (.) match any character, we expect following output:
-a
-b
-c
-d
-e
-f
But the actual output becomes:
H�-b
-c
M�-d
-e
S�-f
This is because the non-ASCII characters are misinterpreted as a sequence of characters or as an invalid UTF-8 string. In Sed, these doesn't match a dot. So what is the solution? There are actually two of them.
Solution 1: iconv
The obvious solution is to convert the input file to the system's default character set before feeding it to Sed, and then convert it back again.
$ iconv -f latin1 -t utf8 example.txt | sed -r 's/.*:/-/' | iconv -f utf8 -t latin1
Solution 2: perl
Another solution is to use Perl, which handles this situation much better. The syntax is almost identical. Just replace sed -r with perl -pe.
$ perl -pe 's/.*:/-/' < example.txt
Personally, I prefer this solution.
Post new comment