Why do some Unicode characters display in matrices, but not data frames in R?

For at least some cases, Asian characters are printable if they are contained in a matrix, or a vector, but not in a data.frame. Here is an example

q<-'天'

q # Works
# [1] "天" 

matrix(q) # Works
#      [,1]
# [1,] "天"

q2<-data.frame(q,stringsAsFactors=FALSE) 
q2 # Does not work
#          q
# 1 <U+5929>

q2[1,] # Works again.
# [1] "天"

Clearly, my device is capable of displaying the character, but when it is in a data.frame, it does not work.

Doing some digging, I found that the print.data.frame function runs format on each column. It turns out that if you run format.default directly, the same problem occurs:

format(q)
# "<U+5929>"

Digging into format.default, I find that it is calling the internal format, written in C.

Before I dig any further, I want to know if others can reproduce this behaviour. Is there some configuration of R that would allow me to display these characters within data.frames?

My sessionInfo(), if it helps:

R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
[3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
[5] LC_TIME=English_Canada.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.0.1


ANSWERS:


I hate to answer my own question, but although the comments and answers helped, they weren't quite right. In Windows, it doesn't seem like you can set a generic 'UTF-8' locale. You can, however, set country-specific locales, which will work in this case:

Sys.setlocale("LC_CTYPE", locale="Chinese")
q2 # Works fine
#  q
#1 天

But, it does make me wonder why exactly format seems to use the locale; I wonder if there is a way to have it ignore the locale in Windows. I also wonder if there is some generic UTF-8 locale that I don't know about on Windows.


I just blogged about Unicode and R several days ago. I think your R editor is UTF-8 and this gives your illusion that R in your Windows handles UTF-8 characters.

The short answer is when you want to process Unicode (Here, it is Chinese), don't use English Windows, use a Chinese version Windows or Linux which by default is UTF-8.

Session info in my Ubuntu:

> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: i686-pc-linux-gnu (32-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       


 MORE:


 ? how to read data in utf-8 format in R?
 ? Reading Chinese characters in R
 ? UTF-8 Width Display Issue of Chinese Characters
 ? Display chinese characters WITHOUT using utf8 encoding?
 ? UTF-8 and Chinese Characters
 ? How to show Chinese characters correctly in rCharts
 ? Chinese Character Encoding (UTF-8, GBK)
 ? Reading CSV files with Chinese Characters
 ? R: extracting "clean" UTF-8 text from a web page scraped with RCurl
 ? R: extracting "clean" UTF-8 text from a web page scraped with RCurl