💻 Tutorial 01: Preparing your data for VIMuRe in R
VIMuRe v0.1.3 (latest)
If you use VIMuRe
in your research, please cite (De Bacco et al. 2023).
TLDR: By the end of this tutorial, you should produce a data frame in the following format:
ego | alter | reporter | tie_type | layer | weight |
---|---|---|---|---|---|
103101 | 103201 | 103201 | borrowmoney | money | 1 |
101201 | 102901 | 102901 | borrowmoney | money | 1 |
111904 | 112602 | 111904 | lendmoney | money | 1 |
113201 | 104001 | 104001 | borrowmoney | money | 1 |
101303 | 115704 | 101303 | lendmoney | money | 1 |
112901 | 113601 | 113601 | borrowmoney | money | 1 |
104801 | 104901 | 104901 | borrowmoney | money | 1 |
108803 | 107503 | 108803 | lendmoney | money | 1 |
102901 | 103701 | 103701 | borrowmoney | money | 1 |
117202 | 115504 | 117202 | lendmoney | money | 1 |
Introduction
Before you start using VIMuRe
, you need to prepare your data in a specific format. This tutorial will show you how to do that.
Here, we will illustrate the process of preparing data for the VIMuRe
package using the “Data on Social Networks and Microfinance in Indian Villages” dataset (Banerjee et al. 2013b). This dataset contains network data on 75 villages in the Karnataka state of India.
⚙️ Setup
We will rely on the following packages in this tutorial:
library(tidyverse)
This will automatically load:
- the pipe
%>%
- the
haven
package, which we will use to read the data that is stored in the Stata DTA format - the
dplyr
package, which we will use to recode and select variables and for things likemutate()
,if_else()
- the
tidyr
package, which we will use to reshape the data
Step 1: Download edgelist
Follow the steps below to download the data.
- Click on this link to download the dataset from Prof. Matthew O. Jackson’s website 1. This will download a file called
IndianVillagesDataFiles.zip
in your working directory. - Unzip the file. This will create a folder called
2010-0760_Data
in your working directory. 💡 Tip: you can use theunzip()
function from within R to unzip the file.
The folder structure should look like this:
The data we need is within that Data/Raw_csv/
folder, and look like this:
Step 2: Collect individual-level metadata
We also need individual-level information metadata. This data is available on a separate source, the Harvard Dataverse (Banerjee et al. 2013a).
- Go to https://dataverse.harvard.edu/file.xhtml?fileId=2460959&version=9.4, read and accept the “License/Data Use Agreement” to gain access to the data. We are using version 9.4 of the dataset.
- Click on the “Access File” button, then “Download | ZIP Archive” to download the data.
- Unzip the file. This will create a folder called
datav4.0.zip
in your working directory.
The data we need is within that datav4.0/Data/2. Demographics and Outcomes/
folder.
- Read the data into R using the
read_dta()
function from thehaven
package:
# Read Stata DTA files
<- haven::read_dta("datav4.0/Data/2. Demographics and Outcomes/individual_characteristics.dta")
indivinfo <- indivinfo[!duplicated(indivinfo$pid)==TRUE,] ## one individual (6109803) is repeated twice. indivinfo
- Ensure that the
pid
is a character vector:
$pid <- as.character(indivinfo$pid) indivinfo
Step 3: Build an edge list per village
We will now build the edge list for each village. We will illustrate the process for village 1, but if you scroll down you will find the full script for all villages.
3.1. Read metadata
Let’s first subset the individual-level metadata to keep only the relevant village:
# Keep track of where the edgelist files are stored
<- "2010-0760_Data/Data/Raw_csv"
RAW_CSV_FOLDER
# Let's focus on just one village for now
<- 1
selected_village
# Filter the individual-level metadata to keep only the relevant village
<- subset(indivinfo, indivinfo$village == selected_village)
resp $didsurv <- 1 resp
The didsurv
column is a dummy variable that indicates whether the individual participated in the survey. We will need this information later to tell our Bayesian model who participated in the survey.
3.2. Read village data
Now, let’s read the village_1.csv
file and merge it with the individual-level metadata:
<- file.path(RAW_CSV_FOLDER, paste("village", selected_village, ".csv", sep = ""))
village_file <- read.csv(village_file, header = FALSE, as.is = TRUE)
indiv colnames(indiv) <- c("hhid", "ppid", "gender", "age")
## gender (1-Male, 2-Female)
$gender <- dplyr::recode(indiv$gender, "Male", "Female")
indiv
## pre-process pid to match the format in the individual-level metadata
$pid <- ifelse(nchar(indiv$ppid)==2, paste(indiv$hhid, indiv$ppid, sep = ""),
indivpaste(indiv$hhid, 0, indiv$ppid, sep = ""))
## Select only the relevant columns
<- c("pid", "resp_status", "religion", "caste", "didsurv")
selected_cols <- merge(indiv, resp[,selected_cols], by = "pid", all.x = TRUE, all.y = TRUE) indiv
Which produces a dataframe that looks like this:
head(indiv)
pid | hhid | ppid | gender | age | resp_status | religion | caste | didsurv |
---|---|---|---|---|---|---|---|---|
100101 | 1001 | 1 | Male | 75 | NA | NA | NA | NA |
100102 | 1001 | 2 | Female | 55 | NA | NA | NA | NA |
100103 | 1001 | 3 | Male | 24 | NA | NA | NA | NA |
100104 | 1001 | 4 | Female | 19 | NA | NA | NA | NA |
100201 | 1002 | 1 | Male | 38 | 1 | 1 | 3 | 1 |
100202 | 1002 | 2 | Female | 27 | 2 | 1 | 3 | 1 |
3.3 Read reports per relationship type
The survey that produced this data collected information on a number of different types of relationships, four of which were “double sampled” (i.e., asked about in two ways, who people go to for that type of support, and who comes to them). Specifically, they asked about borrowing and receiving money, giving and receiving advice, borrowing and lending household items like kerosene and rice, and visiting and receiving guests. These distinct questions are represented in the data files with the following names:
- lendmoney,
- borrowmoney,
- giveadvice,
- helpdecision,
- keroricecome,
- keroricego,
- visitcome
- visitgo
Each of these relationships is stored in a separate file. For example, the file lendmoney1.csv
contains information on who reported lending money to whom in village 1.
We can read each of these files using the read.csv()
function. For example:
First we look over the data and specifying a ALL_NA_CODES
variable. This is a vector of all the codes that, after inspection, we identified were used to represent missing values in the data:
<- c("9999999", "5555555", "7777777", "0") ALL_NA_CODES
We can then read in the data:
<- file.path(RAW_CSV_FOLDER, paste("lendmoney", selected_village, ".csv", sep=""))
filepath_lendmoney <- read.csv(filepath_lendmoney, header = FALSE, as.is = TRUE, na = ALL_NA_CODES) lendmoney
What the data look like
The data is stored here as a node list, but it will need to be further pre-processed as an edge list:
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 |
---|---|---|---|---|---|---|---|---|
100201 | 107603 | NA | NA | NA | NA | NA | NA | NA |
100202 | 102902 | NA | NA | NA | NA | NA | NA | NA |
100601 | 101901 | 102601 | NA | NA | NA | NA | NA | NA |
100602 | 100501 | 101902 | NA | NA | NA | NA | NA | NA |
100701 | 100801 | 102101 | NA | NA | NA | NA | NA | NA |
100702 | 100801 | 104001 | NA | NA | NA | NA | NA | NA |
Each row represents reports made by a single individual. The numbers in the first column are the pid (the “person identifier”) of the individual who reported the relationship. The remaining however many numbers listed in the same row are the pids of the individuals who were reported to be involved in the relationship.
3.4. Pre-process the data to build the edge list
We want the network data to be in the following format, plus a few additional columns:
ego | alter |
---|---|
100201 | 107603 |
100202 | 100201 |
100601 | 101901 |
100601 | 102601 |
100601 | 115501 |
100602 | 100501 |
100602 | 101902 |
100701 | 100801 |
100701 | 102101 |
100702 | 100801 |
To achieve this, we will need to pivot the data.
<- "lendmoney"
tie_type
# Example with the lendmoney data
<- tidyr::pivot_longer(lendmoney, cols=!V1, values_drop_na=TRUE)
edgelist_lendmoney
# View(edgelist_lendmoney) to see what the data look like
This produces a bogus name
column, which we can drop. We should also rename the columns to something more meaningful. It is important that we add a respondent
column. This will be the pid
of the individual who reported the relationship.
<- edgelist_lendmoney %>%
edgelist_lendmoney ::select(-name) %>%
dplyrrename(ego=V1, alter=value) %>%
mutate(reporter=ego)
# Let's also add a column for the tie type
$tie_type <- tie_type
edgelist_lendmoney
# Let's add a weight column too
$weight <- 1 edgelist_lendmoney
producing head(edgelist_lendmoney)
:
ego | alter | reporter | tie_type | weight |
---|---|---|---|---|
100201 | 107603 | 100201 | lendmoney | 1 |
100202 | 102902 | 100202 | lendmoney | 1 |
100601 | 101901 | 100601 | lendmoney | 1 |
100601 | 102601 | 100601 | lendmoney | 1 |
100602 | 100501 | 100602 | lendmoney | 1 |
100602 | 101902 | 100602 | lendmoney | 1 |
So far, we only added tie_type = "lendmoney"
to the data frame, but to make full use of VIMuRe, we also need to add the “flipped question” to the data frame, which in this case is tie_type = "borrowmoney"
. This is because the survey asked two different questions about borrowing and receiving money. The process is the same as before, except that we need to flip the ego
and alter
columns at the end.
There are also some other data cleaning steps that we need to perform: remove self-loops, remove duplicates and keep only reports made by registered reporters. We will do all of that inside a function in the next section, to make it easier to re-use.
4. Automating the process
4.1. Create a function to get the data for a given village and tie type
This function will also take care of the data cleaning steps that we described in the previous section. Importantly, it will also map the double-sampled tie types to the layer names we will use in VIMuRe.
Click here to expand the code for the get_karnataka_survey_data()
function
<- function(
get_karnataka_survey_data
village_id,
tie_type,
indivinfo,ties_layer_mapping = list(
borrowmoney = "money",
lendmoney = "money",
giveadvice = "advice",
helpdecision = "advice",
keroricego = "kerorice",
keroricecome = "kerorice",
visitgo = "visit",
visitcome = "visit"
),all_na_codes=c("9999999", "5555555", "7777777", "0"),
raw_csv_folder=RAW_CSV_FOLDER){
# Filter the individual-level metadata to keep only the relevant village
<- subset(indivinfo, indivinfo$village == village_id)
resp $didsurv <- 1
resp
<- file.path(raw_csv_folder, paste("village", village_id, ".csv", sep = ""))
village_file <- read.csv(village_file, header = FALSE, as.is = TRUE)
metadata colnames(metadata) <- c("hhid", "ppid", "gender", "age")
## gender (1-Male, 2-Female)
$gender <- dplyr::recode(metadata$gender, "Male", "Female")
metadata
## pre-process pid to match the format in the individual-level metadata
$pid <- ifelse(nchar(metadata$ppid)==2, paste(metadata$hhid, metadata$ppid, sep = ""),
metadatapaste(metadata$hhid, 0, metadata$ppid, sep = ""))
## Select only the relevant columns
<- c("pid", "resp_status", "religion", "caste", "didsurv")
selected_cols <- merge(metadata, resp[,selected_cols], by = "pid", all.x = TRUE, all.y = TRUE)
metadata
<- file.path(raw_csv_folder, paste(tie_type, village_id, ".csv", sep=""))
filepath <- read.csv(filepath, header = FALSE, as.is = TRUE, na = all_na_codes)
df_raw
<- tidyr::pivot_longer(df_raw, cols=!V1, values_drop_na=TRUE)
edgelist
<- edgelist %>%
edgelist ::select(-name) %>%
dplyrrename(ego=V1, alter=value) %>%
mutate(reporter=ego)
# Let's also add a column for the tie type
$tie_type <- tie_type
edgelist
# Let's add a weight column too
$weight <- 1
edgelist
# If the question was "Did you borrow money from anyone?", then we need to flip the ego and alter columns
if(tie_type %in% c("borrowmoney", "helpdecision", "keroricego", "visitgo")){
<- edgelist %>% rename(ego=alter, alter=ego)
edgelist
}
# Create a layer column and reorder the columns to make it easier to work with VIMuRe later
<- edgelist %>%
edgelist mutate(layer = unlist(ties_layer_mapping[tie_type])) %>%
select(ego, alter, reporter, tie_type, layer, weight)
#### Further pre-processing steps ####
# Who could actually report on the ties?
<- metadata %>%
reporters filter(didsurv == 1) %>%
pull(pid) %>%
as.vector()
<- reporters %>% union(edgelist$ego) %>% union(edgelist$alter)
nodes
# Only keep reports made by those who were MARKED as reporters in metadata CSV
<- edgelist %>% filter(reporter %in% reporters)
edgelist
# Remove self-loops
<- edgelist %>% filter(ego != alter)
edgelist
# Remove duplicates
<- edgelist %>% distinct()
edgelist
return(list(edgelist=edgelist, reporters=reporters))
}
<- function(metadata_file = DEFAULT_METADATA_FILEPATH){
get_indivinfo <- haven::read_dta(metadata_file)
indivinfo ## one individual (6109803) is repeated twice. Remove the duplicate
<- indivinfo[!duplicated(indivinfo$pid) == TRUE, ]
indivinfo return(indivinfo)
}
4.2 Getting an edgelist per layer
Each double-sampled tie type is mapped to a layer in VIMuRe. The mapping can be seen in the function we created above and is also shown below.
= list(borrowmoney = "money",
ties_layer_mapping lendmoney = "money",
giveadvice = "advice",
helpdecision = "advice",
keroricego = "kerorice",
keroricecome = "kerorice",
visitgo = "visit",
visitcome = "visit")
Therefore, to get the edgelist for, say the money
layer, we need to combine the borrowmoney
and lendmoney
tie types. We can do this by using the get_karnataka_survey_data
function we created above.
# Get the edgelist for the money layer
<- get_karnataka_survey_data(village_id = 1, tie_type = "lendmoney", indivinfo = indivinfo)
output <- output$edgelist
edgelist_lendmoney <- output$reporters
reporters
<- get_karnataka_survey_data(village_id = 1, tie_type = "borrowmoney", indivinfo = indivinfo)$edgelist
edgelist_borrowmoney
<- rbind(edgelist_lendmoney, edgelist_borrowmoney) edgelist_money
which now gives us all the edges for the money
layer:
set.seed(100) # set the random seed for reproducibility
%>% sample_n(size = 10, replace = FALSE) edgelist_money
ego | alter | reporter | tie_type | layer | weight |
---|---|---|---|---|---|
111903 | 111902 | 111902 | borrowmoney | money | 1 |
104901 | 104101 | 104101 | borrowmoney | money | 1 |
111401 | 109701 | 109701 | borrowmoney | money | 1 |
118001 | 112605 | 112605 | borrowmoney | money | 1 |
106205 | 106302 | 106205 | lendmoney | money | 1 |
100701 | 100801 | 100701 | lendmoney | money | 1 |
111502 | 108902 | 111502 | lendmoney | money | 1 |
100501 | 100801 | 100801 | borrowmoney | money | 1 |
112601 | 111903 | 111903 | borrowmoney | money | 1 |
117301 | 109505 | 109505 | borrowmoney | money | 1 |
The above is the format we want the data to be in! This format will make it easier to work with VIMuRe
. Although, only the ego
, alter
, reporter
columns are required. The tie_type
, layer
and weight
columns are optional, but useful to have.
Use the full pre-processing script below to pre-process all the data for all tie types and save it to a single vil1_money.csv
file. We also save the respondents
list to, as a data frame, to a vil1_money_respondents.csv
file.
Click to see full pre-processing script
# Load the required packages
library(tidyverse)
# Set the working directory accordingly
# setwd("C:/Users/.../karnataka_survey")
<- c(1:12, 14:21, 23:77) # village IDs 13 and 22 are missing
VALID_VILLAGE_IDS <- "2010-0760_Data/Data/Raw_csv"
RAW_CSV_FOLDER
= list(borrowmoney = "money",
ties_layer_mapping lendmoney = "money",
giveadvice = "advice",
helpdecision = "advice",
keroricego = "kerorice",
keroricecome = "kerorice",
visitgo = "visit",
visitcome = "visit")
<- function(
get_karnataka_survey_data
village_id,
tie_type,
indivinfo,ties_layer_mapping = list(
borrowmoney = "money",
lendmoney = "money",
giveadvice = "advice",
helpdecision = "advice",
keroricego = "kerorice",
keroricecome = "kerorice",
visitgo = "visit",
visitcome = "visit"
),all_na_codes=c("9999999", "5555555", "7777777", "0"),
raw_csv_folder=RAW_CSV_FOLDER){
# Filter the individual-level metadata to keep only the relevant village
<- subset(indivinfo, indivinfo$village == village_id)
resp $didsurv <- 1
resp
<- file.path(raw_csv_folder, paste("village", village_id, ".csv", sep = ""))
village_file <- read.csv(village_file, header = FALSE, as.is = TRUE)
metadata colnames(metadata) <- c("hhid", "ppid", "gender", "age")
## gender (1-Male, 2-Female)
$gender <- dplyr::recode(metadata$gender, "Male", "Female")
metadata
## pre-process pid to match the format in the individual-level metadata
$pid <- ifelse(nchar(metadata$ppid)==2, paste(metadata$hhid, metadata$ppid, sep = ""),
metadatapaste(metadata$hhid, 0, metadata$ppid, sep = ""))
## Select only the relevant columns
<- c("pid", "resp_status", "religion", "caste", "didsurv")
selected_cols <- merge(metadata, resp[,selected_cols], by = "pid", all.x = TRUE, all.y = TRUE)
metadata
<- file.path(raw_csv_folder, paste(tie_type, village_id, ".csv", sep=""))
filepath <- read.csv(filepath, header = FALSE, as.is = TRUE, na = all_na_codes)
df_raw
<- tidyr::pivot_longer(df_raw, cols=!V1, values_drop_na=TRUE)
edgelist
<- edgelist %>%
edgelist ::select(-name) %>%
dplyrrename(ego=V1, alter=value) %>%
mutate(reporter=ego)
# Let's also add a column for the tie type
$tie_type <- tie_type
edgelist
# Let's add a weight column too
$weight <- 1
edgelist
# If the question was "Did you borrow money from anyone?", then we need to flip the ego and alter columns
if(tie_type %in% c("borrowmoney", "helpdecision", "keroricego", "visitgo")){
<- edgelist %>% rename(ego=alter, alter=ego)
edgelist
}
# Create a layer column and reorder the columns to make it easier to work with VIMuRe later
<- edgelist %>%
edgelist mutate(layer = unlist(ties_layer_mapping[tie_type])) %>%
select(ego, alter, reporter, tie_type, layer, weight)
#### Further pre-processing steps ####
# Who could actually report on the ties?
<- metadata %>%
reporters filter(didsurv == 1) %>%
pull(pid) %>%
as.vector()
<- reporters %>% union(edgelist$ego) %>% union(edgelist$alter)
nodes
# Only keep reports made by those who were MARKED as reporters in metadata CSV
<- edgelist %>% filter(reporter %in% reporters)
edgelist
# Remove self-loops
<- edgelist %>% filter(ego != alter)
edgelist
# Remove duplicates
<- edgelist %>% distinct()
edgelist
return(list(edgelist=edgelist, reporters=reporters))
}
<- function(village_id, layer_name, indivinfo,
get_layer raw_csv_folder=RAW_CSV_FOLDER){
<- list(
tie_types money = c("borrowmoney", "lendmoney"),
advice = c("giveadvice", "helpdecision"),
kerorice = c("keroricego", "keroricecome"),
visit = c("visitgo", "visitcome")
)
<- tie_types[[layer_name]]
selected_tie_types
<- data.frame()
edgelist <- c()
reporters
for(tie_type in selected_tie_types){
<- get_karnataka_survey_data(village_id, tie_type, indivinfo, raw_csv_folder=raw_csv_folder)
data <- rbind(edgelist, data$edgelist)
edgelist <- union(reporters, data$reporters)
reporters
}
return(list(edgelist=edgelist, reporters=reporters))
}
# Read Stata DTA files
<- haven::read_dta("datav4.0/Data/2. Demographics and Outcomes/individual_characteristics.dta")
indivinfo <- indivinfo[!duplicated(indivinfo$pid)==TRUE,] ## one individual (6109803) is repeated twice.
indivinfo
for(i in VALID_VILLAGE_IDS){
for(layer_name in c("money", "advice", "kerorice", "visit")){
<- get_layer(i, layer_name, indivinfo, raw_csv_folder=RAW_CSV_FOLDER)
data <- data$edgelist
edgelist <- data$reporters
reporters
# Save the edgelist
<- file.path(paste("village", i, "_", layer_name, ".csv", sep = ""))
edgelist_file write.csv(edgelist, edgelist_file, row.names = FALSE)
# Save the reporters
<- file.path(paste("village", i, "_", layer_name, "_reporters.csv", sep = ""))
reporters_file write.csv(data.frame(reporter=reporters), reporters_file, row.names = FALSE)
}
}
References
Footnotes
Note that the authors provide a different version of the network data on Harvard Dataverse (Banerjee et al. 2013a). However, we will use the raw version provided by Prof. Jackson in this tutorial, as the version of the Dataverse has already had some pre-processing (importantly, they have made the adjacency matrices symmetric), while the version provided by Prof. Jackson gives the original node list. We will use the Harvard Dataverse files just for the metadata.↩︎