Announcement

Creating a new variable based on the value of other variables

27 Feb 2020, 06:47

I have created a categorical variable (values 1 through 5) in excel based on certain characteristics of US states in a 20 year time period.

I have a large data set with over 2 million rows and I am trying to assign the value of the categorical variable to each row in my data set based on the state and the year.

What I have tried so far is:

gen state_year = 1
replace state_year = 5 if (year==1962 & state==1)


However, since there are a 1000 combinations of states and years (50*20), this is going to take a very long time to code in a do-file.

Is there a way to import my excel file so that it creates a variable and assigns the value of that variable depending on what is in the excel?
I have attached a screenshot of the first few rows of my excel document.

Any help would be much appreciated. Thanks in advance!

Tags: None Stephen Jenkins 27 Feb 2020, 07:03

It's not clear to me what the 'rule' is according to which you create the categorical variable "state-year". E.g. does "year" refer to a specific calendar year? What is the range of 'year' in your data set and does it differ across states? And we can't even read the full name of the column heading in your spreadsheet snapshot. (Remember that screenshots are deprecated here.)

I would recommend first that you back up a bit and think about what those 'rules' are; hence how one would code the creation of the variable you require. Second, I suggest that you get your spreadsheet data into Stata 'as is': see help import excel, for instance. And then do the variable creation from there.

Welcome to Statalist. Please take a few minutes to read the Forum FAQ (hit the black line at top of page), especially the sections about how to post effectively (including the sections about dataex).

Comment

Post Cancel Igor Paploski 27 Feb 2020, 07:26
egen category = group(State-se Year)
To understand what it does, type help egen and look for entry for group.

Comment

Post Cancel Harry Day 27 Feb 2020, 08:47 Originally posted by Stephen Jenkins View Post

It's not clear to me what the 'rule' is according to which you create the categorical variable "state-year". E.g. does "year" refer to a specific calendar year? What is the range of 'year' in your data set and does it differ across states? And we can't even read the full name of the column heading in your spreadsheet snapshot. (Remember that screenshots are deprecated here.)

I would recommend first that you back up a bit and think about what those 'rules' are; hence how one would code the creation of the variable you require. Second, I suggest that you get your spreadsheet data into Stata 'as is': see help import excel, for instance. And then do the variable creation from there.

Welcome to Statalist. Please take a few minutes to read the Forum FAQ (hit the black line at top of page), especially the sections about how to post effectively (including the sections about dataex).

The rule for the categorical variable is the measurement of the restrictiveness of abortion in each of the 50 US states (in the period 1962-1982). It's name should probably not be state_year as this is misleading, apologies for the confusion here. I wall call it abor_rest instead.

In my Stata data set, I have micro-data on 2 million subjects, from all 50 states across the period 1962-1982. The variable names include state (which is a number from 1 to 50) and year (which is the year), along with many other variables which are not relevant to my query.

Separate to this, I have constructed a categorical variable (abor_rest) in excel, with values 1 through 5, which varies depending on the state and the year in question. This variable has values for all 50 states for the aforementioned period. Below I have renamed the column so it is clearer as to what I am trying to do.

The column heading is state_serial, this just assigns a value between 1 and 50 to each state.

In my Stata data set, what I want to do is create a new variable (abor_rest), and assign the value of abor_rest from my excel file to my Stata data. For example, any row which has year of 1962 and state as 1 (Alabama) will be designated as 5 for my new variable.


This means that rows 741-746 would populate as 5 under the abor_rest column.

I know how to do this individually:

gen abor_rest = .
replace abor_rest = 5 if (year==1962 & state==1)

but do not know how to do this on an aggregated level. for 50 states over 20 years.

I hope what I am asking is clearer.