While loading data using read_sas from sas dataset in Python getting b' along with every string. How can I fix this?
- 1,831
- 11
- 23
- 34
- 11
- 1
2 Answers
The other answer from Bala_Science provides a good suggestion to use encoding="utf-8" in the read_sas function, which addresses this issue of of extra characters like b' appearing in converted strings. However, it lacks an explanation of why this issue occurs, does not explore alternative solutions, and generalizes performance concerns with SAS datasets without providing solutions to improve performance..
I have found that getting b' appearing in strings often occurs when binary data from the SAS file is decoded improperly into text, often due to a mismatch in character encoding. By specifying the correct encoding, such as utf-8, it should be possible to get the data read in properly.
One way to do it in python is what @Bala_Science suggested in their answer:
import pandas as pd
mydf = pd.read_sas("mydataset.sas7bdat", format='sas7bdat', encoding="utf-8")
If that does not resolve the problem, then it probably requires a different encoding, so some guesswork might need to be done. Tools like https://freeonlineformatter.com/encoding-string and similar.
If they do have SAS, then perhaps the best solution is to use `SASPy:
import saspy
sas = saspy.SASsession()
df = sas.sasdata("mydataset", libref="work").to_df()
Note that this requires a SAS License and an active SAS backend. However, it does have another advantage - it is much faster than pandas - because, as also mentioned by @Bala_Science, loading SAS files into python with pandas can be slow. The suggestion in the other answer, to load from CSV files, is also a possibility, however as far as I know that would require using SAS to export the data to CSV, but if they have SAS then I think it makes more sense to use saspy. Perhaps there are some online conversion tools. A quick google search brought up this:
https://dumbmatter.com/sas7bdat/
which looks promising (I have not tried it).
Finally, if they don't have SAS available, and the online conversion tools do not do the job then another solution is to make use of the pyreadstat library:
import pyreadstat
df, meta = pyreadstat.read_sas7bdat("mydataset.sas7bdat")
- 103
- 1
- 4
Try using endcoding=utf-8
Code:
import pandas as pd
mydf=pd.read_sas("\home\mypath\mydataset.sas7bdat",format = 'sas7bdat', encoding="utf-8")
If the sas dataset is small, I would recommend loading from CSV file instead of sas dataset. Performance loading from SAS using Python is poor.
- 19
- 1