Error in encoding mysql -> How can I reconvert it to something else?

I started a website some time ago using the wrong CHARSET in my DB and site. The HTML was set to ISO... and the DB to Latin... , the page was saved in Western latin... a big mess.

The site is in French, so I created a function that replaced all accents like "é" to "é". Which solved the issue temporarily.

I just learned a lot more about programming, and now my files are saved as Unicode UTF-8, the HTML is in UTF-8 and my MySQL table columns are set to ut8_encoding...

I tried to move back the accents to "é" instead of the "é", but I get the usual charset issues with the (?) or weird characters "â" both in MySQL and when the page is displayed.

I need to find a way to update my sql, through a function that cleans the strings so that it can finally go back to normal. At the moment my function looks like this but doesn't work:

function stripAcc3($value){

 $ent =   array(
          'à'=>'à', 
          'â'=>'â', 
            'ù'=>'ù', 
          'û'=>'û',
            'é'=>'é', 
          'è'=>'è', 
          'ê'=>'ê', 
            'ç'=>'ç', 
            'Ç'=>'Ç', 
            "î"=>'î', 
            "Ï"=>'ï', 
            "ö"=>'ö', 
            "ô"=>'ô', 
            "ë"=>'ë', 
            "ü"=>'ü', 
            "Ä"=>'ä',
            "€"=>'€',
          "′"=> "'",
          "é"=> "é"
        );

    return strtr($value, $ent);
}

Any help welcome. Thanks in advance. If you need code, please tell me which part.

UPDATE

If you want the bounty points, I need detailed instructions on how to do it. Thanks.


ANSWERS:


Try using the following function instead, it should handle all the issues you described:

function makeStringUTF8($data)
{
    if (is_string($data) === true)
    {
        // has html entities?
        if (strpos($data, '&') !== false)
        {
            // if so, revert back to normal
            $data = html_entity_decode($data, ENT_QUOTES, 'UTF-8');
        }

        // make sure it's UTF-8
        if (function_exists('iconv') === true)
        {
            return @iconv('UTF-8', 'UTF-8//IGNORE', $data);
        }

        else if (function_exists('mb_convert_encoding') === true)
        {
            return mb_convert_encoding($data, 'UTF-8', 'UTF-8');
        }

        return utf8_encode(utf8_decode($data));
    }

    else if (is_array($data) === true)
    {
        $result = array();

        foreach ($data as $key => $value)
        {
            $result[makeStringUTF8($key)] = makeStringUTF8($value);
        }

        return $result;
    }

    return $data;
}

Regarding the specific instructions of how to use this, I suggest the following:

  1. export your old latin database (I hope you still have it) contents as an SQL/CSV dump *
  2. use the above function on the file contents and save the result on another file
  3. import the file you generated in the previous step into the UTF-8 aware schema / database

* Example:

file_put_contents('utf8.sql', makeStringUTF8(file_get_contents('latin.sql')));

This should do it, if it doesn't let me know.


You might want to investigate what is used to fix WP database encoding issues:

To cut a long story short, most old WP sites were created with Swedish/Latin1 collated tables, which were used to store UTF8 strings. To collate the tables properly, the approach is to change the column to binary type, and then to change it to UTF8 text.

This avoids that the text gets wrangled when converting from Latin1 to UTF8 directly.


You will need to convert the offending rows using for example iconv. The challenge for you will be to know what rows are already UTF-8 and which are latin-1.


I'm not completely sure I understand your question, but if you have

  • a UTF-8 database

  • all special characters in there stored as HTML entities

then a

html_entity_decode($string, ENT_QUOTES, "UTF-8");

should do the trick and turn all entities back into their UTF-8 native characters.


Make sure, not just your tables use utf-8, your database connection should use utf-8 as well.

$this->db = mysql_connect(MYSQL_SERVER,DB_LOGIN,DB_PASS);
mysql_set_charset  ('utf8',$this->getConnection());

If you want to discuss with your database in UTF-8 you have to tell the Database that the connexion flow is a UTF-8 flow. You have to sent a request before each request you make to the database, this request in the following :

"SET NAMES utf8";

Personnaly I use that in the connect.inc.php files which create the connection to the database. Which this statement the database know that your sending UTF-8 encoded string and works perfectly !

mysql_set_charset function isn't working well, i tried this function in the past but the truth is that it don't do the trick.

For your complete issue, if you want to convert latin1 string to UTF-8, you have to convert first the latin1 string to a binary string format. Then convert the binary string into UTF-8 string, all can be done inside the database with database commands. See that artile (in french) :

I can tell you that this method works because i used it to transform data from a database I've created.



 MORE:


 ? Setting PHP default encoding to utf-8?
 ? How to change a string to utf8 encoded?
 ? How can I post data with overlong encoding to test for vulnerabilities?
 ? How to set text file encoding in PHP?
 ? Transferring extended ascii characters with unknown encoding to a Twisted XMLRPC from C#
 ? Transferring extended ascii characters with unknown encoding to a Twisted XMLRPC from C#
 ? Transferring extended ascii characters with unknown encoding to a Twisted XMLRPC from C#
 ? how to: Twisted privmsg to accept non-ascii strings
 ? How to send XML-RPC request from XML-RPC callback in twisted
 ? Allowing any ASCII character transmitted in string with xmlrpc