H1B Data Description

The USCIS received around 233,000 H1B petitions during the filling period in 2015, which is another new high after last year’s 172,000 applications. The USCIS uses a compter-generated random selection process to select enough petitions to meet the 6,500 general-category cap and 20,000 cap under the advanced degree exemption. To increase tranparency, the United States release a Excel file every year with data pertaining to H-1B Visas and the like. This year’s data can be downloaded here. The dataset contains certification determinations issued between October 1, 2014 and March 30, 2015. Thanks to my alumni John Zhou who introduced this dataset to me. I explored the dataset using R with several data analysis and visulization libraries.

Here are the columns in the dataset that I found interesting:

Data Analysis:

We start with loading the libraries we are going to use:

library(ggplot2)
library(readxl)
library(knitr)
library(dplyr)
library(maps)
library(stringr)
opts_chunk$set(fig.width=10, fig.path="../figure/")

The Excel file is over 80 MB which is a bit slow for R to swallow, so I decided to try out the newly published [readxl package] (http://cran.r-project.org/web/packages/readxl/index.html).

file.info("H-1B_FY15_Q2.xlsx")$size
## [1] 87517742
h1b<-read_excel("H-1B_FY15_Q2.xlsx")

It works just fine for me.

Data Cleaning

First, I selected only H1-B petitions and converted the wages to year bases using WAGE_RATE_OF_PAY and WAGE_UNIT_OF_PAY columns.

#select the column of interest, fix the data type, convert wage to year bases
h1b_tidy<-h1b%>%select(CASE_NO,
                       CASE_STATUS,
                       CASE_SUBMITTED,
                       DECISION_DATE,
                       VISA_CLASS,
                       EMPLOYER_NAME,
                       SOC_NAME,
                       WAGE_RATE_OF_PAY_FROM,
                       WAGE_RATE_OF_PAY_TO,
                       WAGE_UNIT_OF_PAY,
                       WORKSITE_COUNTY,
                       WORKSITE_STATE)%>%
                filter(VISA_CLASS=="H-1B")%>%
                mutate_each_(funs(as.factor), 
                             c("CASE_STATUS",
                               "VISA_CLASS",
                               "EMPLOYER_NAME",
                               "SOC_NAME", 
                               "WAGE_UNIT_OF_PAY", 
                               "WORKSITE_COUNTY", 
                               "WORKSITE_COUNTY"))%>%
                mutate_each_(funs(as.numeric),
                             c("WAGE_RATE_OF_PAY_FROM","WAGE_RATE_OF_PAY_TO"))%>%
                mutate(WAGE_RATE=apply(cbind(WAGE_RATE_OF_PAY_FROM, WAGE_RATE_OF_PAY_TO),1, FUN=function(x) mean(x,na.rm=TRUE)))%>%
                mutate(WAGE=WAGE_RATE*unlist(sapply(as.character(WAGE_UNIT_OF_PAY),  switch, "Week"=52,
                                              "Bi-Weekly"=52/2,
                                              "Hour"=8*240,
                                              "Month"=12,
                                              "Year"=1,0)))

There are appearently some incorrect records in the wages, e.g. wrong wage unit. Here I removed records with yearly wage more than half a million. Please let me know if you have applied H1B this year with more than half a million income, we should make friends:)

h1b_tidy<-filter(h1b_tidy, WAGE<500000 | WAGE==0, VISA_CLASS=="H-1B")
nrow(h1b_tidy)
## [1] 375974

I used SOC_NAME as the occupation information since it covers much wider range than JOB_TITLE column. For example, PROGRAMMER ANALYST and DEVELOPMENT SENIOR are both classified as COMPUTER PROGRAMMERS using SOC_NAME.

Questions of Interest

Let us check several questions of interest.

  • Top 5 majors of H1B applications
socTop<-row.names(as.data.frame(summary(h1b_tidy$SOC_NAME)))[1:5]
h1b_tidy<-mutate(h1b_tidy, soc5=ifelse(SOC_NAME %in% socTop, 
                                         as.character(SOC_NAME), 
                                         "Others"))
soc<-table(h1b_tidy$soc5)
percent<-paste0(100*round(soc/sum(soc),3),"%")
labels<-do.call(paste,c(as.data.frame((cbind(names(soc),percent))),list(sep=": ")))
pie(as.numeric(soc/sum(soc)),label=labels,radius=0.7,cex=0.7)

Computer science is definitely the way to go. For incoming international students who haven’t chosen a major, I hope this is informative enough.

  • Top 8 popular employer
companyTop<-row.names(as.data.frame(summary(h1b_tidy$EMPLOYER_NAME)))[1:8]
h1b_tidy<-mutate(h1b_tidy, company5=ifelse(EMPLOYER_NAME %in% companyTop, 
                                         as.character(EMPLOYER_NAME), 
                                         "Others"))

company<-table(h1b_tidy$company5)
percent<-paste0(100*round(company/sum(company),3),"%")
labels<-do.call(paste,c(as.data.frame((cbind(names(company),percent))),list(sep=": ")))
pie(as.numeric(company/sum(company)),label=labels,radius=0.7,cex=0.6)

Interestingly, most of the top popular employers are technology service companies which have headquarter or major business in India. They are INFOSYS, IGATE (recently accquired by Capgemini), IBM India, HCL, WIPRO, and TATA.

  • Top 5 jobs with the highest income.
H1B_income<-h1b_tidy%>%group_by(SOC_NAME)%>%summarise(AvgIncome=mean(WAGE))%>%arrange(desc(AvgIncome))
ggplot(H1B_income[1:5,], aes(x=SOC_NAME, y=AvgIncome))+
  geom_bar(stat="identity", fill = I("grey50"))+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
  xlab("Job Title")+
  ylab("Average Income ($ per year)")

For international students, attending medical schools seems very promising, although the upfront investment could be huge.

  • Number of H1b Applicants of the lower 48 states:
library(maps)
H1B_State<-h1b_tidy%>%group_by(WORKSITE_STATE)%>%summarise(NoH1B=length(WORKSITE_STATE))%>%arrange(desc(NoH1B))
H1B_State$STATE<-tolower(state.name[match(H1B_State$WORKSITE_STATE,state.abb)])
states_map <- map_data("state")
ggplot(H1B_State, aes(map_id = STATE)) + 
    geom_map(aes(fill = NoH1B), map = states_map) +
    scale_fill_gradient(low="#dfd6e9", high="#0570b0", name="No. H1B Applicants")+ 
    expand_limits(x = states_map$long, y = states_map$lat) +
    theme(
    plot.background = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    axis.ticks = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank()
  )

I have found a lot of similarity between this map and the GDP by state map.