Beware when using Proc Copy with Pandas

How to deal with file encodings

Alberto Negron true
11-28-2018

Imaging the following scenario:

In your SAS infrastructure you have a base library (let’s call it out) that you use as a bridge between SAS and Python. Any SAS dataset that needs to be imported in python is copied to this library.

There are various methods to copy sas datasets from one library to another but I will mention only two: Proc copy and the data step.

Proc copy is really handy when you need to copy a large number of datasets and it does it perfectly - all attributes of the original datasets are copied along with the data to the new datasets.

With the data step data is copied but some of the original attributes are overwritten based on various factors.

One of key attributes to susccessfully read SAS datasets in Pandas is the string encoding. When not given, Pandas treats characters as raw bytes. SAS supports way more encodings that Pandas and is really easy to produce a dataset with an encoding that Pandas does not how to read.

Here is an example:


libname out '/folders/myfolders/output';

proc copy in=sashelp out=out;
select heart ;
run;

data out.heart2;
set sashelp.heart;
run;

%LET DSID=%SYSFUNC(open(out.heart,i));
%PUT %SYSFUNC(ATTRC(&DSID,ENCODING));

%LET DSID=%SYSFUNC(open(out.heart2,i));
%PUT %SYSFUNC(ATTRC(&DSID,ENCODING));

Reading the log we can see the dataset created using Proc copy is encoded as us-ascii which theoretically Pandas should support through python standard encodings but it doesn’t!


 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 72         
 73         
 74         %LET DSID=%SYSFUNC(open(out.heart,i));
 75         %PUT %SYSFUNC(ATTRC(&DSID,ENCODING));
 us-ascii  ASCII (ANSI)
 76         
 77         %LET DSID=%SYSFUNC(open(out.heart2,i));
 78         %PUT %SYSFUNC(ATTRC(&DSID,ENCODING));
 utf-8  Unicode (UTF-8)
 79         
 80         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 93         

Here is a small overview of the nasty error you get in python:


sasds = pd.read_sas('~/projects/SASstudio/myfolders/output/heart.sas7bdat',
                    encoding='us-ascii')

---------------------------------------------------------------------------

UnicodeDecodeError                        Traceback (most recent call last)

<ipython-input-7-3a7293429424> in <module>()
----> 1 sasds = pd.read_sas('~/projects/SASstudio/myfolders/output/heart.sas7bdat'
    ,encoding='us-ascii')
    
... omitted ...


 UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position 0: 
                      ordinal not in range(128)

You could say “I will deals with these bytes in python later” and you may get away with it as long as there are no missing values in your original dataset. With missing values you get NaN in python and your new column is made of bytes and floats (NaN) values and that is really painful to deal with so avoid at all cost!!!

One quick fix is just to use out.heart2 created using a data step as it comes utf-8 ready.

Another option which would my preferred option is to define the SAS library utf-8 as default:


libname out '/folders/myfolders/output' outencoding='UTF-8';

Here is the link to the SAS Note so you can check with platforms are supported.

That’s all for now.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Negron (2018, Nov. 28). Data Addict's Secret Diary: Beware when using Proc Copy with Pandas. Retrieved from http://www.dataaddict.me/posts/2018-11-28-beware-when-using-proc-copy-with-pandas/

BibTeX citation

@misc{negron2018beware,
  author = {Negron, Alberto},
  title = {Data Addict's Secret Diary: Beware when using Proc Copy with Pandas},
  url = {http://www.dataaddict.me/posts/2018-11-28-beware-when-using-proc-copy-with-pandas/},
  year = {2018}
}