Skip directly to content

How to use Sed together with UTF-8

on Tue, 2010-07-27 01:56

One of the eldest Unix utilities is the stream editor Sed. It was born 1973, and is still going strong. If you haven't yet made its acquaintance, I highly recommend that you read Bruce Barnett's Sed -- An Introduction and Tutorial. I use the GNU version of Sed almost everyday.

There is just one problem with Sed. It sometimes get confused by character encoding, in particular UTF-8 and ISO-8859-1.

Problem

If Sed is operating on a character stream with another encoding than the system's default, it can get confused. A typical situation is that you are using a modern Linux distribution, e.g. Ubuntu, which has UTF-8 as the default character set, and you feed Sed with a file encoded with the character set ISO-8859-1 (also known as Latin-1). Let us demonstrate this case.

Create a file example.txt with some Swedish first names encoded in ISO-8859-1:

$ echo -e "Arvid:a\nHåkan:b\nTore:c\nMärta:d\nNils:e\nSören:f" | \
iconv -f utf8 -t latin1 -o example.txt

The resulting file should look as follows on an operating system with UTF-8 as default character set:

$ cat example.txt
Adam:a
H�kan:b
Tore:c
M�rta:d
Nils:e
S�ren:f

Now, let Sed replace any characters up to and including a colon (:) with a dash (-):

$ sed -r 's/.*:/-/' < example.txt

Since dot (.) match any character, we expect following output:

-a
-b
-c
-d
-e
-f

But the actual output becomes:

H�-b
-c
M�-d
-e
S�-f

This is because the non-ASCII characters are misinterpreted as a sequence of characters or as an invalid UTF-8 string. In Sed, these doesn't match a dot. So what is the solution? There are actually two of them.

Solution 1: iconv

The obvious solution is to convert the input file to the system's default character set before feeding it to Sed, and then convert it back again.

$ iconv -f latin1 -t utf8 example.txt | sed -r 's/.*:/-/' | iconv -f utf8 -t latin1

Solution 2: perl

Another solution is to use Perl, which handles this situation much better. The syntax is almost identical. Just replace sed -r with perl -pe.

$ perl -pe 's/.*:/-/' < example.txt

Personally, I prefer this solution.

Post new comment