############# ## ## Commands for reading in leukemia.txt ## ############# ### Stata ### Note that several of the variables are character variables, and others are dates in MM/DD/YY format. All of those variables should be read into Stata as string variables. The following Stata code can be copied into a .do file and then executed to read in the data. infile ptid str10 onstudy str1 tx str1 sex age fab karn wbc plt /// hgb str1 eval str1 cr crchemo str10 crdate str10 fudate /// str1 status str1 bmtx str10 bmtxdate str1 incl /// using http://www.emersonstatistics.com/datasets/leukemia.txt Note also that the first line of the text file contains the variable names, and will thus be converted to missing values for the numeric variables. Similarly, there is some missing data recorded as ‘NA’, and those, too, will be converted to missing values. If you do not want to see all the warning messages, you can use the "quietly" prefix. You may want to go ahead and drop the first case using "drop in 1", because it is just missing values. ### R ### The data can be read in using leukemia <- read.table("http://www.emersonstatistics.com/datasets/leukemia.txt", header=TRUE, stringsAsFactors=FALSE) ############# ## ## Commands for handling dates in leukemia.txt ## ############# ### Stata ### If you want to convert the dates to “Julian dates” (days since Jan 1, 1960 in Stata) you can use code like g onstJ= date(onstudy, "MD19Y") g fudtJ= date(fudate, "MD19Y") After creating those variables, the observation time (in days from onstudy to the earlier of death or first analysis of the data) can then be created by code like g obstime= fudtJ - onstJ ### R ### If you want to convert the dates to "Julian dates" (days since Jan 1, 1970 in R) you can use the following code to replace the strings with date objects: leukemia$onstudy <- as.Date(leukemia$onstudy,"%m/%d/%y") leukemia$fudate <- as.Date(leukemia$fudate,"%m/%d/%y") Note that when R prints a date object, it formats it as a date, rather than the number. You can see what the number is is you type as.numeric(leukemia$onstudy) You could add a variable obstime to the leukemia dataframe by leukemia$obstime <- leukemia$onstudy - leukemia$fudate ############# ## ## Commands for encoding string variables to numbers in leukemia.txt ## ############# ### Stata ### In order to use the binary variables in most analyses in Stata, we need them to be coded as numbers. Stata has a function "encode" that can be used to do this, but I recommend AGAINST using this. It will encode the strings as 1,2,... in alphabetical order. I would rather have the codes as 0,1 for binary variables. Thus I use: g male= . replace male=1 if sex=="M" replace male=0 if sex=="F" ### R ### We could get away without encoding the string variables in R, but the commands we would use in our regression models might get a little more cumbersome. I recommend creating the binary indicator variables, There are LOTS of ways I could do this. The one I would choose here is something like the following, which will set the variable to missing if sex is not "M" or "F": leukemia$male <- ifelse(leukemia$sex=="M",1,ifelse(leukemia$sex=="F",0,NA))