Online Conversion from SQL_ASCII to UTF8 in PostgreSQL
Credits: George Hansper, Ricardo Vassellini, Evgeny Shebanin, Sam McLeod
Scripts and source available here: sql_ascii_to_utf8
To be able to take a Postgres Database which is in SQL_ASCII encoding, and import it into a UTF8 encoded database.
Postresql will generate errors like this if it encounters any non-UTF8 byte-sequences during a database restore:
And the corresponding data will be omitted from the database (in this case, the whole table, even the rows which did not have a problem):
To find and replace characters in an SQL_ASCII encoded database which do not conform to the UTF8 encoding requirements. This should be implemented as an SQL script, so that the data can be updated on a live database, to mininmise the downtime required.
There is an existing script which will find the offending rows in a table: http://sniptools.com/databases/finding-non-utf8-values-in-postgresql
While this will find the offending rows, it cannot do anything about it, as it stands. A more sophisticated script is required to actually replace the non-UTF8 characters with something acceptable. Two SQL scripts are provided here, the first of these SQL scripts is:
This will replace all non-UTF8 characters with underscores. This (at least) will allow the data to be imported successfully to a UTF8 database. The second of these SQL scripts is:
will replace selected bytes with a UTF8-sequence which corresponds to what is rendered by Firefox when encoded as
where NN is the hexadecimal value of the byte.
Not all values from
80-FF are covered by this script. Please add your own translations as required.
Any byte without a specific translation will be replaced with an underscore.
The Triggers Problem
process_non_utf8_at_schema() work just fine, BUT if there are any ‘triggers’ on the rows being updated, these triggers are also invoked.
Such triggers may expect a specific set of fields to be updated together, or increment sequence numbers.
Running these triggers would be an undesirable side-effect of what should be a simple text-update.
The Locking Problem
The original solution was designed to minimise downtime, and these scripts would be ineffective if they were to lock table for anything more than a couple of seconds.
Unfortunately, this is exactly what happens if triggers are diabled per-table while updating the text like this:
Postgres wraps it all in a transaction, and locks the table until the update is complete (which can be minutes on a large table).
The Non-Locking, Non-Triggering Solution
a) Don’t use
ALTER TABLE _some_table_ DISABLE TRIGGER ALL;
Instead, use a session-based setting:
This has the effect of disabling triggers, but does not lock the whole table while the function is running.
b) Don’t try to run the whole DB in one go.
The following script does a search for offending table,column combinations, and then invokes process_non_utf8_at_column() on each of these individually.
Sample DB and Outputs
A test database can be created using the script:
This should be run as follows:
When we run the script, we should see the following output: