Processing Unicode Data in Python - A Primer to Understand Non-English Data Processing

Introduction:

Currently we live in a world where people of diverse cultures/backgrounds use electronic devices to express their ideas, do their daily work that earns them their daily bread, and entertain themselves using content that is created using their own language and so on. Naturally, in order to make all these things happen, any computational instrument, be it a laptop or a desktop computer, or a smartphone, or something else, should be capable enough to serve all of these things in a manner that is transparent to the end user. Normally, we create programs that are capable of handling natural languages from around the entire world. Creating and maintaining a framework that helps us perform this mammoth task is a very critical task. Hence, think tanks have come up with sets of standards that help us bind these diverse languages (and their smallest units, the characters that constitute that language) in a way that is easy to code with.

Some Concepts

What is Unicode?

Every computation, we perform using a computer, be it programming, or writing a blog/article, or manipulating data in some way, involves characters. So we may safely say that a character is the smallest component of any computation. However, as mentioned above, a character can belong to any natural language, and hence we need a way to represent it. This is where Unicode comes into the picture. What Unicode does is it assigns a “code point” to each and every known/prominent character set in the world (for example, the English language has 26 characters, and hence Unicode assigns 26 different “code points” to the entire set. Same goes for French, Chinese, Hindi, Sanskrit (?), Afrikaans, and so on). A “code point” is basically a number in the range between 0 and 10ffff (hex – so it should be written as 0x10ffff. Since by default we will be talking using a hex representation of characters, we will drop the '0x' part in most cases. If we use any other base, we will specifically mention it in the context of its usage). Each number in this range maps to a specific character in some character set. Hence, we sometimes loosely specify a character by its code point, but that isn't technically correct. The number does represent the character, but it isn't the character itself. This is an important point to wrap your mind on, since things tend to get complex if we forget this distinction. As such, a character on a screen is created using certain graphical elements called 'glyphs'. So the character 'V' is created using 2 diagonal lines that merge at a certain point on the screen. This hasn't got anything to do with the Unicode code points, except that the character being displayed on the screen is represented as a numeric code point for ease of use.

Encoding: The code point entities we mentioned in the previous section would need to represent using a series of bytes. Ultimately, in any computational instrument, we use bytes as a means to retain and represent information. So how do we represent these code points to bytes? Remember, a Unicode string is nothing but a series of code points. So in order to represent them in a sequence of bytes, we perform a technique known as 'encoding'. Now, we can represent a series of code points in a sequence of 4 byte integers, but there are some inherent problems with that approach.

Firstly, representing a code point in bytes is a sheer waste of space. Most positions will have zeros in them, and hence they will hog a whole lot of network bandwidth if ever we want to transfer them to some other system. Secondly, it will have portability issues as some computers are “big Indian”, while others are “little Indian” (if you do not understand what they mean, you may simply consider them to be conventional ordering of bytes for a specific processor unit. To know more about it, please visit this link or this link.). Thirdly, handling zero bytes embedded in textual data is not possible for certain international standards organizations. There are a host of other issues with this approach, but the above mentioned ones are the major issues.

So, where do we go from here?

As it turns out, this is not the only encoding scheme we have at hand. (Thanks to all those people who have toiled hard to give us such goodies, without their names being featured anywhere. That is the spirit of the internet and open source technologies in general, and current policies of some of the titanic tyrants like YouTube and Microsoft are destroying it by trying to ducktape alternative ideas. This is quite beside the point of this article, but it is always good to be in the know what large corporations can do with the freedom of expression of common fellow humans. Please note that this is my personal take on this matter and the publisher is in no way involved in it). Anyway, back to our discussion: we also have several other encoding schemes that are quite widely used – UTF-8, utf-16, Latin-1 (ISO-8859-1), EBCDIC, etc. Out of these, the UTF-8 encoding is perhaps most widely used. Python natively uses 'ASCII' encoding and hence it has support of characters from 0 to 127. Any character with an encoding above 127 raises an exception and displays an error message (something in the lines of “ASCII codec can't decode byte ...”). You can try printing a character with an encoding greater than 127 from python's interactive command prompt, and that will show you the entire error statement.

With UTF-8, python provides great support for most characters. The rules governing the usage of UTF-8 are as follows:

If the character has a code point < 127, then the corresponding byte value is used.
If the character has a code point between 128 and 7ff (hex), then the character is stored in 2 bytes within the range from 128 and 255.
If the code point of the character is greater than 7ff(hex), then depending on the character represented by the code point, it is provided with 3 or 4 bytes, with each byte between 128 and 255.

Some of the advantages of using UTF-8 are as follows:

A Unicode string can be converted to a string of bytes with no zero bytes embedded. This is a huge advantage as the data can be sent over a network at a faster rate than our previous method.
If one or more bytes are corrupted, then it would still be possible to find the start of the next UTF -8 encoded code point. This will help in synchronizing data again.
It can handle all Unicode code points. This almost every character is taken care of.
ASCII is a subset of the Unicode encoding, so ASCII strings can be handled without any modifications whatsoever.
There are a host of other advantages, along with a few disadvantages. But the advantages outweigh the demerits very comprehensively.

The Latin-1 (ISO-8859-1) is also a reasonably good encoding if you intend to use characters with code points between 0 and 255. Beyond that range, it fails.

EBCDIC (IBM's encoding scheme) has its own share of advantages and disadvantages. We will not go into an in-depth study of this encoding mechanism, but suffice it to say that a mapping script should be in place if one is to use this encoding. The characters are divided into separate chunks, and hence one has to use a mapping scheme to handle this encoding appropriately.

Python uses a data type called 'Unicode' for the sole purpose of handling Unicode strings. It derives (indirectly) from the 'str' type, and hence most methods available for 'str' type are available for the 'Unicode' data type. In our code, we will use this type's constructor (named 'Unicode'). Normally, python represents 'Unicode' strings as 16 bit integers, but this can be changed to handle 32 bit integers when compiling the python interpreter code.

The Unicode constructor has the following signature:

Unicode (string, [enc, errors])

Where 'string' is the string to be encoded in Unicode, 'enc' is the encoding to be used (default value is 'ASCII'), and 'errors' is the policy to deal with errors if they arise for some reason. The values of 'errors' are 'strict', 'replace' and 'ignore'. 'strict' basically makes the code throw an exception as soon as the characters in the string violate the encoding parameter, 'replace' basically replaces the character that could not be encoded using a set of characters ('U+FFFD') that is provided as a replacement string, and 'ignore' simply ignores that character that could not be converted and makes the location of the character empty. 'ignore' should be used in very rare cases where it is next to impossible to do the encoding, since it drops the character, and that means it throws away information, which is definitely not a good thing to do.

Well, enough talk on this topic, now let us see some examples. Please note that I will be running code from my python interactive command prompt, but you may easily copy the lines in a file and execute it.

#Lets start with UTF-8

>>>import os, sys, re # some of the modules that I import irrespective of them being used in the code or not. However, more often than not, they end up being used.

>>> c= b'\x85AWHO Colony\x80 Housing Society'

>>> c.decode ("utf-8","strict")

Trackback (most recent call last):

File "<stdin>", line 1, in <module>

File "/home/supriyo/work/blogs/blogenv/lib/python2.7/encodings/utf_8.py", line 16, in decode

return codecs.utf_8_decode(input, errors, True)

UnicodeDecodeError: 'utf8' codec can't decode byte 0x85 in position 0: invalid start byte

So, using “strict”, I immediately get an error specifying the cause of the failure. Basically, 'UTF-8' cannot decode the character '0x85', and immediately raises an exception and stops running the code. Note that it didn't even check out '0x80', since it failed in the first character itself.

Now, let us see what we can do with 'replace':

>>>import os, sys, re # Some of the modules that I import irrespective of them being used in the code or not. However, more often than not, they end up being used.

>>> c= b'\x85AWHO Colony\x80 Housing Society'

>>> c.decode("utf-8","replace")

u'\ufffdAWHO Colony\ufffd Housing Society'

So, now you can easily see the difference above. Since we specified 'replace', it replaced the parts Unicode couldn't handle with the string '\ufffd'. However, even this parameter should be used with caution. This also causes loss of information, and hence it is no better than the 'ignore' value of the parameter.

Let's see 'ignore' in action. This will be pretty dumb. Never use it in situations where financial dealings are made. Use your judgement in order to decide whether you wish to delete or not.

>>>import os, sys, re # Some of the modules that I import irrespective of them being used in the code or not. However, more often than not, they end up being used.

>>> c= b'\x85AWHO Colony\x80 Housing Society'

>>> c.decode("utf-8","ignore")

u'AWHO Colony Housing Society'

As you can very well see, it has simply dropped the \x85 and \x80 code points. This won't be acceptable is a lot of situations.

So, the best thing we saw so far is the 'strict' value for the 'errors' parameter. It does throw an error and thus it is up to the programmer/developer to handle this situation. So how do we handle such a situation?

In order to handle this type of situation, there is no straight forward way. There is no thumb rule that can provide you with a solution. So, what you might want to do is to find out the context of the document in which the string exists and guess the language it is written in. Next, look up this link. The page contains about 100 encodings based on various languages, and one of them should provide you with a correct solution. For example, you may find that the encoding 'cp1252' is the correct encoding for your case, so you might use the following line of encode the string in utf-8:

str.decode('cp1252').encode('utf-8')

where 'str' is the given string that contains the characters that are causing you the issues.

There are a few functions that you should know about when dealing with Unicode data: “unichr” and “ord”. 'unichr' takes an integer value and returns you the Unicode string of length 1 that corresponds to the code point passed to it as an integer. “ord” tasks the other way round – so it gets a single Unicode character and back the code point pertaining to that Unicode character. See them in action below:

>>> unichr(49000)

u'\ubf68'

>>> ord(u'\ubf68')

49000

These are pretty helpful functions and you will need them if you start doing some serious Unicode manipulation using python.

As I had mentioned earlier, the 'unicode' datatype is derived from the 'str' datatype (the 8 bit string), and hence there are quite a few methods that are similar between these 2 datatypes. Examples are 'count', 'find', 'replace', 'upper', etc. I won't get into these functions here as they are pretty much trivial to use and hence they do not require any explanation. You can, however, check them out on your python interactive console as an exercise.

Unicode Characters in Python Source Code:

So far, we have seen how to handle Unicode data from python. What if the python code it contains Unicode characters? Well, the way to handle this is quite easy. But why would one need such a thing? Can such scenarios actually occur? They definitely can. For example, at one point of time in my career as a software developer, we were working on dark web. Basically, we were supposed to collect data on darknet websites that sell drugs. It involved scraping certain websites that had content in some non-English language (I don't remember what the language was) and we were using Beautiful Soup to parse its contents. In order to parse the page, we had to rely on content that appeared repeatedly. So, this is how we handled the situation:

# -*- coding: utf-8 -*-

from __future__ import unicode_literals

from django.shortcuts import render

from django.views.decorators.csrf import csrf_exempt, csrf_protect

#from django.core.context_processors import csrf

from django.views.generic import View

from django.http import HttpResponseBadRequest, HttpResponse , HttpResponseRedirect, HttpRequest

...

Please note the first line. That line handles all non-English characters in the code. Here we have not shown the rest of the code as that is not in the public domain. But as you can see, the first line states that the code is going to contain non-English characters and they should be handled appropriately. You can use other encodings as well. For example, in order to useLatin-1, you could do the following:

# -*- coding: latin-1 -*-

#code comes here....

Conclusion:

In the above discussion, we just scratched the surface of Unicode usage in python. There are quite a few other things that can be done with Unicode in python. For example, you could string manipulations, regex matching and other cool stuff in python with Unicode. You may refer here for all such operations. However, I hope this post would be a decent place to start understanding the concepts of Unicode usage in Python. I would definitely bring out a part 2 of this topic sometime soon. Thanks for your patience if you have read this article till this point.

Processing Unicode Data in Python - A Primer to Understand Non-English Data Processing

RELATED

0 COMMENT

ABOUT

HOW IT WORKS

FOLLOW US

FEEDBACK