To be able to take a Postgres Database which is in SQL_ASCII encoding, and import it into a UTF8 encoded database.
Postresql will generate errors like this if it encounters any non-UTF8 byte-sequences during a database restore:
And the corresponding data will be omitted from the database (in this case, the whole table, even the rows which did not have a problem):
To find and replace characters in an SQL_ASCII encoded database which do not conform to the UTF8 encoding requirements.
This should be implemented as an SQL script, so that the data can be updated on a live database, to mininmise the downtime required.
There is an existing script which will find the offending rows in a table:
While this will find the offending rows, it cannot do anything about it, as it stands.
A more sophisticated script is required to actually replace the non-UTF8 characters with something acceptable.
Two SQL scripts are provided here, the first of these SQL scripts is:
will replace selected bytes with a UTF8-sequence which corresponds to what is rendered by Firefox when encoded as &#NN;
where NN is the hexadecimal value of the byte.
Not all values from 80-FF are covered by this script. Please add your own translations as required.
Any byte without a specific translation will be replaced with an underscore.
The Triggers Problem
The functions process_non_utf8_at_column() and process_non_utf8_at_schema() work just fine, BUT if there are any ‘triggers’ on the rows being updated, these triggers are also invoked.
Such triggers may expect a specific set of fields to be updated together, or increment sequence numbers.
Running these triggers would be an undesirable side-effect of what should be a simple text-update.
The Locking Problem
The original solution was designed to minimise downtime, and these scripts would be ineffective if they were to lock table for anything more than a couple of seconds.
Unfortunately, this is exactly what happens if triggers are diabled per-table while updating the text like this:
Postgres wraps it all in a transaction, and locks the table until the update is complete (which can be minutes on a large table).
The Non-Locking, Non-Triggering Solution
a) Don’t use ALTER TABLE _some_table_ DISABLE TRIGGER ALL;
Instead, use a session-based setting:
This has the effect of disabling triggers, but does not lock the whole table while the function is running.
b) Don’t try to run the whole DB in one go.
The following script does a search for offending table,column combinations, and then invokes process_non_utf8_at_column() on each of these individually.