Read SAS datasets in Python

A short intro on how to read SAS datasets using Pandas

Alberto Negron true
11-27-2018

This is the 1st post of a series of posts regarding to SAS & Python. My main goal is to show differences and similarities in the treatment of data between these tools. Notice I do not say SAS vs Python as this phrase has been used to fuel the nonsense war of which tool is better.

Every tool has its ups & downs and one should pick the tool that fits best.

Sometimes this tool is R, some others is a mix of tools such as SAS & Python.

I have diverted from the original post but I wanted to be clear with respect the SAS vs Python dilemma.

Anyhow, here is a artificial data set created in SAS which contains 2 columns: date a numeric column with format datetime and color a character column with no specific format.


libname out '/folders/myfolders/output';

data out.times(drop=i);
do i='27nov2018:15:00:00'dt to '27nov2018:20:00:00'dt by 3610;
 date = i;
 color = choosec(rand('integer',1,4),'red','blue','green','yellow');
 output;
format date datetime.;
end;
run;

We can read this dataset straight into python using the Pandas package. We’ll use the read_sas() method to create new dataframe in python. Note: dataset == dataframe.

read_sas() takes a few arguments (most of them comes with pre-defined values that you can altered if required). The one mandatory is file_path but I like to add format (although method can infer whether it is a xport or sas7bdat) to explicity define the file extension.


import pandas as pd
df = pd.read_sas('~/projects/SASstudio/myfolders/output/times.sas7bdat',format='sas7bdat')
print(df)

                 date     color
0 2018-11-27 15:00:00   b'blue'
1 2018-11-27 16:00:10  b'green'
2 2018-11-27 17:00:20   b'blue'
3 2018-11-27 18:00:30  b'green'
4 2018-11-27 19:00:40    b'red'

if no encoding is specified then character columns are read as raw bytes. Bytes literals are always prefixed with ‘b’ or ‘B’; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.

Therefore we do want to use an encoding that translate a our SAS character column to a Python string type column. More often than not you will be fine using utf-8 but as a recommendation always check what is the default encoding of your OS.


df = pd.read_sas('~/projects/SASstudio/myfolders/output/times.sas7bdat',format='sas7bdat',encoding='utf-8')
print(df)

                 date  color
0 2018-11-27 15:00:00   blue
1 2018-11-27 16:00:10  green
2 2018-11-27 17:00:20   blue
3 2018-11-27 18:00:30  green
4 2018-11-27 19:00:40    red

Let’s do another example, lets read in sashelp.air into Python. This dataset contains a column date with date format MONYY.


df2 = pd.read_sas('~/projects/SASstudio/myfolders/output/air.sas7bdat',format='sas7bdat',encoding='utf-8')
print(df2[1:11])

         DATE    AIR
1  1949-02-01  118.0
2  1949-03-01  132.0
3  1949-04-01  129.0
4  1949-05-01  121.0
5  1949-06-01  135.0
6  1949-07-01  148.0
7  1949-08-01  148.0
8  1949-09-01  136.0
9  1949-10-01  119.0
10 1949-11-01  104.0

As you can see, Pandas displays the date column as yyyy-mm-dd rather than MONYY. If we need to replicate the SAS format in python we can use the dt.strftime function.


df2['date2'] = df2['DATE'].dt.strftime('%b%y')
print(df2[1:11])

         DATE    AIR  date2
1  1949-02-01  118.0  Feb49
2  1949-03-01  132.0  Mar49
3  1949-04-01  129.0  Apr49
4  1949-05-01  121.0  May49
5  1949-06-01  135.0  Jun49
6  1949-07-01  148.0  Jul49
7  1949-08-01  148.0  Aug49
8  1949-09-01  136.0  Sep49
9  1949-10-01  119.0  Oct49
10 1949-11-01  104.0  Nov49

and job done!!!

As you can see, it is not really difficult to import SAS datasets into python. Obviously there is always the edge case were there is no straight forward solution but hey life was not supposed to be easy ;-)

Keep in tune as I will be adding more tips & tricks for python & SAS.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Negron (2018, Nov. 27). Data Addict's Secret Diary: Read SAS datasets in Python. Retrieved from http://www.dataaddict.me/posts/2018-11-27-read-sas-datasets-in-python/

BibTeX citation

@misc{negron2018read,
  author = {Negron, Alberto},
  title = {Data Addict's Secret Diary: Read SAS datasets in Python},
  url = {http://www.dataaddict.me/posts/2018-11-27-read-sas-datasets-in-python/},
  year = {2018}
}