Few More Thoughts

Data Extraction from Outlook Attachments using R

Geethika Wijewardene — Thu, 11 Nov 2021 10:37:23 GMT

When excel files get delivered through e-mail attachments, how can we extract the data and consolidate into a single table?

Here I present an automated process to extract the attachments from Outlook emails and consolidate them using R. I use the RDCOMClient package (https://github.com/omegahat/RDCOMClient or https://www.stat.berkeley.edu/~nolan/stat133/Fall05/lectures/DCOM.html). Thus, this solution will work only on Windows.

For instance, my research team is stationed at remote areas, where they have no access to internet. They record measurements hourly and record them in an excel template. At the end of the day, they will email me the excel file as an attachment with a common subject (REA0001 - Measurements). I need to extract these hourly measurements into one table for analysis. If I receive 50 emails in a day, am I going to manually open each email attachment and, copy the data into a file? This is a very time consuming, tedious and error prone approach. Thus, I would use the following piece of code to automate my job.

Step 1 : Extract emails with the same subject from outlook mailbox

In this example, every email is sent by the same subject. Therefor, I use the subject to search for the email in the mail box. Also the Outlook application needs to be opened while running the code.

library(RDCOMClient)
library(dplyr)
library(stringr)

working_dir<-"C:/Users/geethika.wijewardena/Workspace/R-extract-email-attachments/"

#--------------------------------------------
# Extract emails from outlook
#--------------------------------------------
# Create a new instance of Outlook COM server class
outlook_app <- COMCreate("Outlook.Application")
# Create a search object to search the mail box by given criteria (e.g. subject)
search <- outlook_app$AdvancedSearch(
  "Inbox",
  "urn:schemas:httpmail:subject = 'REA0001 - Measurements'"
)
# Allow some time for the search to complete
Sys.sleep(5)
results <- search$Results()

Step 2: Filter emails by date and extract data in attachment

The results object above contains all emails with the given subject. However, since I need only the ones I received today, I filter the emails by date. Next, for each email, I save the attachment. I present two approaches to save the attachment by:

a) filename of the attachment and
b) name of the sender (in case if the filename is inconsistent).

In approach (a), each saved attachment is read and loaded into a dynamic variable of its filename within the loop The Measurement field is renamed by the filename. Later, all these tibbles are joined/consolidated into a single table.

#------------------------------------------------------------------------------
# Approach (a)
# Extract emails and save the attachment by the name of the attachment
#------------------------------------------------------------------------------

# Filter search results by receive date
for (i in 1:results$Count()){
  receive_date <- as.Date("1899-12-30") + floor(results$Item(i)$ReceivedTime())
  if(receive_date >= as.Date("2019-10-09")) {
    # Get the attachment of each email and save it by the name of the attachment
    #   in a given file path
    email <- results$Item(i)
    attachment_file <- paste0(working_dir,email$Attachments(1)[['DisplayName']])
    email$Attachments(1)$SaveAsFile(attachment_file)

    # Read each attachment and assign data into a variable (which is the filename)
    #   generated dynamically, 
    df_name <- str_sub(email$Attachments(1)[['DisplayName']],1,-6)
    data <- readxl::read_excel(attachment_file, col_types =c("date", "numeric"),
                               col_names = T) %>% 
      rename(!!df_name := "Case")%>% 
      mutate(Hour = str_sub(as.character(Hour),11,nchar(as.character(Hour))))
    assign(df_name, data)
  }
}

# Consolidate all dataframes into one
dat <- lapply(ls(pattern="REA"), function(x) get(x)) %>% 
  purrr::reduce(full_join, by = "Hour")

In approach (b), getDataFromEmailAtt() function filters each email by the date, saves them by the name of the sender and returns the tibble with the Measurement field renamed by the sender's name. This function is called within a loop which joins/consolidates each data set into one table.

#------------------------------------------------------------------------------
# Approach (b)
# Extract emails and save the attachment by the name of the sender
#------------------------------------------------------------------------------
getDataFromEmailAtt<- function(results, i){
  # Function to extract data from email attachement, save it it a specified
  # directory by the name of the sender, read the saved excel file and return 
  # a dataframe with a given colum named by the sender's name.
  # Args: results - object returned by search$Results() of RDCOMClient for 
  #                 outlook applications.
  #       i - order number of the extracted emails in the results object 
  # Returns: Dataset of the email attachment with given column renamed by the 
  #          sender's name
  # 
  receive_date <- as.Date("1899-12-30") + floor(results$Item(i)$ReceivedTime())
  if(receive_date >= as.Date("2019-10-09")) {
    # Get the attachment of each email and save it by the name of the attachment
    #   in a given file path
    email <- results$Item(i)
    attachment_file <- paste0(working_dir,email[['SenderName']],'.xlsx')
    email$Attachments(1)$SaveAsFile(attachment_file)

   data <- readxl::read_excel(attachment_file, col_names = T) %>% 
      rename(!!df_name := "Measurement")%>% 
      mutate(Hour = str_sub(as.character(Hour),11,nchar(as.character(Hour))))
  return(data)
  }
}

# Get the first dataset
dat <- getDataFromEmailAtt(results, i=1)

# Append datasets of the other emails
for (i in 2:results$Count()){
  data <- getDataFromEmailAtt(results, i)
  dat <- dat %>% inner_join(data, by=c('Hour'))

Material of this example is at my GitHub repo https://github.com/geethika01/R-extract-email-attachments.

Can we ever accept the null hypothesis?

Geethika Wijewardene — Thu, 28 Oct 2021 11:56:49 GMT

Not having enough evidence to reject the null hypothesis doesn't mean the null hypothesis is necessarily true. Here I explain why, using an example.

Students in a certain college are more inclined to use drugs than U.S. college students in general. The proportion of drug users among collage students in general is 0.157. We take two random samples of 100 and 400 students from the collage. The proportions of drug users in both samples is 0.19 (19/100 and 76/400). Since this proportion is higher than the population proportion (0.157), can we declare that students in this collage are more inclined to use drugs?

Hypothesis testing

Step 1: State the null hypothesis (H0) and the alternative hypothesis (Ha).

Step 2: Collect relevant data from a random sample and summarize them (using a test statistic)

2.1 - Check that the conditions under which the test can be reliably used ( n*p >= 10 and n(1-p) >= 10 are met.
2.2 - Calculate the test statistic

Test statistic describes how far the observed sample proportion from the population proportion in standard deviations. It is calculated using the following formula.

Note: When we obtain a random sample of size n from a population with a population proportion p, the possible values of the sample proportion (p^), which is the sampling distribution of the proportions, is given by the mean (p) and standard deviation calculated by the following formula.

Step 3: Find the p-value, the probability of observing data like those observed assuming that Ho is true.

Step 4: Based on the p-value, decide whether we have enough evidence to reject Ho (and accept Ha), and draw our conclusions in context.

Test hypothesis as below

Step 1:

Ho - Proportion of drug users in the collage is the same as the population proportion (p = p0)

Ha - Proportion of drug users in the collage is higher than the population proportion (p > p0)

Steps 2 and 3:

Sample 1 : n = 100; mean proportion of the population (p) = 0.157; standard deviation = 0.018; observed proportion (p^) = 0.19

np = 100 0.19 = 19

n(1-p) = 100 * (1-0.19) =81

Sample 2 : n = 400; mean proportion of the population (p) = 0.157; standard deviation = 0.036; observed proportion (p^) = 0.19

np = 4000.19 = 76

n(1-p) = 400(1-0.19) = 324

Calculate the test statistic and the p-value in R as below.

> # Sample 1
> p_1 <- prop.test(x=19, n=100, p=0.157, alternative = "greater", conf.level = 0.95, correct = T)
> p_1

    1-sample proportions test with continuity correction

data:  19 out of 100, null probability 0.157
X-squared = 0.59236, df = 1, p-value = 0.2208
alternative hypothesis: true p is greater than 0.157
95 percent confidence interval:
 0.1297316 1.0000000
sample estimates:
   p 
0.19 
> # Sample 2
> p_2 <- prop.test(x=76, n=400, p=0.157, alternative = "greater", conf.level = 0.95, correct = T)
> p_2

    1-sample proportions test with continuity correction

data:  76 out of 400, null probability 0.157
X-squared = 3.0466, df = 1, p-value = 0.04045
alternative hypothesis: true p is greater than 0.157
95 percent confidence interval:
 0.1586989 1.0000000
sample estimates:
   p 
0.19

According to sample 1 ( n= 100 and p-value =0.22 >0.05) , it is very likely that we get a sample of 100 students with a proportion of drug users similar to 0.157. Thus, we do not have enough evidence to reject Ho, or to state that 'proportion of drug users in the collage is higher than the population proportion'. Therefore, can we accept the null hypothesis?

With a sample of 400 students, the p-value (0.04 < 0.05) suggests that it is very unlikely that the proportion of drug users will be 0.157. Now we have enough evidence to reject Ho and state that the 'proportion of drug users in the collage is higher than the population proportion'.

Therefore, when the p-value of a sample is higher than 0.05, we never can accept Ho, but only state that we do not have enough evidence to reject Ho. It might be that the sample size was simply too small to detect a statistically significant difference, or in other words, a larger sample of same proportion can provide evidence to reject the Ho or to detect a statistically significant difference. As the sample size increases, results become more significant.

Bayes Rule - Notes and Examples

Geethika Wijewardene — Fri, 20 Aug 2021 12:43:14 GMT

What are my chances of being pregnant if the over-the-counter pregnancy test turns out to be positive? What are my chances of getting cancer if I smoke? Or what are my chances of having cancer if my mammogram is negative? Bayes rule can be used to answer....

Brush up on Conditional Probability

Conditional probability is when an event occurring, assuming that one or more other events have already occurred. If two events are independent of each other, then P(B|A) = P(B). On the other hand if event B is dependent on event A, then P(B|A) is as below.

P(B|A) = P(A intersect B)/P(A)

NOTE: P(A and B) is the same as P(A intersect B).

Example: Out of 1000 people, Democratic Male = 200; Democratic Female = 300; Republican Male = 300 and Republican Female = 200.

A = Being a democrat and B being a women

P(A and B) = 300/1000 = 0.3 = 30%

P(B|A) = P(A and B)/ P(A) = 0.3/0.5 = 0.6 = 60%

Bayes Rule

Update the probability of happening of an event given a new piece of evidence.

For example, in 2011, there were 98 pregnancies for every 1,000 women (9.8%) aged 15–44 in the United States. 88% of the pregnancies have been positively detected by the over-the-counter pregnancy tests, while 95% of negative responses of these tests have been identified as not pregnant. Given that a test is positive, what are my chances of being pregnant?

Terminology

Prior probability/ Base Rate: P(Preg=T) - Pregnant women= 9.8%

Posterior probability: P(Preg=T|Test = Pos) - Given a pregnancy test is positive, what is the probability of being pregnant?

Likelihood/ Sensitivity: P(Test = Pos|Preg=T) - Given a woman is pregnant, what is the probability of the test beign positive?

Evidence/Marginal Likelihood: P(Test=Pos) - total probability of observing the evidence (i.e.Probability of having a test positive)

Specificity: P(Test = Neg|Preg = F)- given a woman is not pregnant, what is the probability of the test being negative?

'Pr' = Pregnant 'not Pr' = not pregnant 'Pos' = Test is positive 'Neg' = Test is negative

P(Pos | Pr) = P(Pos and Pr)/ P(Pr)

P(Pr| Pos) = P(Pr and Pos)/ P(Pos)

But, P(Pos and Pr) = P(Pr and Pos)

Therefore, P(Pr|Pos) = P(Pos | Pr) * P(Pr) / P(Pos)

When the denominator (P(Pos)) is not available, we can calculate it by

P(Pos) = P( Pos |Pr) P(Pr) + P( Pos | not Pr) P(not Pr)

where P( Pos |not Pr) = 1 - P(Neg|not Pr)

# Function to calculate Bayes Rule in Python 
def calcProbBayesRule(prob_prior, prob_sensitivity, prob_evidence = None, prob_specificity = None):    
    if prob_evidence is None:
        if prob_specificity is None:
            raise ValueError ('prob_specificity cannot be None when prob_evidence is None')
        else:
            prob_not_prior = 1 - prob_prior                        
            prob_evidence = (prob_sensitivity * prob_prior) +
                            ((1-prob_specificity) * prob_not_prior)            
            prob_posterior = prob_sensitivity * prob_prior/prob_evidence      
    else:
         prob_posterior = prob_sensitivity * prob_prior/prob_evidence          
    return (str(round(prob_posterior * 100,2)) + '%')

Example 1: When the denominator is known

Cancer and Smoking : 5% of the population has cancer and 10% of the population are smokers. Also 20% of the people with cancer are smokers. Given that a person is a smoker, what is the probability that he/she will get cancer?

P(C) = 0.05 P(S) = 0.1

P(S|C) = 0.2

P(C|S) = 0.2 * 0.05/0.1 = 0.1 (10%)

# Calculation in Python
prob_cancer_given_smoking = calcProbBayesRule(0.05, 0.2, prob_evidence= 0.1)
print(prob_cancer_given_smoking)
'10.0%'

Example 2: When the denominator is unknown

Breast cancer and mammograms: 1% of women have breast cancer. 80% of mammograms detect breast cancer when it is there (and therefore 20% miss it). 9.6% of mammograms detect breast cancer when it’s not there (and therefore 90.4% correctly return a negative result).

P(C) = 0.01 ; P(not C) =0.99

P( Test= T|C ) = 0.8 ; P(Test=F|C) = 1 - P( Test= T|C ) = 0.2

P(Test = T| not C) = 0.096 ; P(Test=F | not C) = 0.904

a) For a woman whose mammogram return positive, what is the probability of getting breast cancer?

P(C|Test = T) = P(Test = T|C) * P(C)/ P(Test=T)

Since P(Test=T) is not given, it is derived by,

P(Test = T) = P(Test=T|C) P(C) + P(Test=T|not C) P(not C)

P(Test = T) = (0.8 0.01)+ (0.096 0.99) = 0.103

P(C|Test = T) = 0.8 * 0.01/0.103 = 0.0776 (7.76%)

# Calculation in Python
prob_cancer_given_test_positive = calcProbBayesRule(0.01, 0.8, prob_specificity= 0.904)
print(prob_cancer_given_test_positive)
'7.76%'

Therefore, for a woman whose mammogram return positive there is only 8% chance of having cancer.

b) For a women whose mammogram return negative, what is the probability of getting cancer?

P(C| Test=F) = P(Test = F|C) * P(C)/ P(Test = F)

P(Test=F) = 1 - P(test=T) = 0.9

P(C| Test=F) = 0.2 * 0.01/0.9 = 0.0022 (o.22%)

# Calculation in Python
prob_cancer_given_test_negative = calcProbBayesRule(0.01, 0.2, prob_evidence= 0.9)
print(prob_cancer_given_test_negative)
'0.22%'

Therefore, women whose mammogram return negative, there is only 0.22% probability of getting cancer.

Pregnancy and over-the-counter tests: Referring to the example mentioned above,

P(Pr) = 0.098 P(not Pr) = 0.902

P(Pos | Pr) = 0.88 (Sensitivity = 88%)

P(Neg | not Pr) = 0.95 (Specificity = 95%)

P(Pos | not Pr) = 1 - P(Neg | not Pr) = 0.05

P(Pos) = (0.098 0.88) + (0.05 0.902) = 0.13134

P(Pr | Pos) = P(Pos | Pr) P(Pr)/ P(Pos) = 0.88 0.098/0.13134 = 0.6566= 66%

# Calculation in Python
prob_preg_given_test_pos = calcProbBayesRule(0.098, 0.88,prob_specificity= 0.95 )
print(prob_preg_given_test_pos)
65.66%

Therefore, by the percentage of pregnancies in USA in 2011, if the given over-the-counter test turned out to be positive, there is still only 66% chance of being pregnant, whether you like it or not!!!

How to learn about an unknown data set quickly? - R and Python

Geethika Wijewardene — Thu, 18 Feb 2021 11:32:55 GMT

When you come across an unknown data set, it is important to get to know about it before running into analysis. For instance, knowing the available fields, their data types, count of missing, unique or completed values and their distributions and presence/absence of outliers help to assess the suitability of the data set for the targeted analysis or where it needs cleaning.

R and Python have these functionalities readily available at various levels of detail.

R is my main EDA tool as of now and I am a big fan of tidyverse. When I first come across a data set in R, I usually use skim() function of the skimr package to get to know about the data set. Trying to find a similar function in Pandas was a frustrating experience until I came across Google Facets. In this post, I will first introduce you to skim() and show how to use Google Facets to get a similar outcome in Python.

Why skim() in R?

skim() is great to learn about the variables, their data type, missing values, unique values and some statistics on the distribution of variables of different types. Let me show you in an example below.

I use the Baby Names from Social Security Card Applications - National Data data set downloaded from https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data.

Figure 1. Summary of the data set from skim() in R

As shown in Figure 1, skim() lists the dimensions of the data set first, then groups variables by their data types and shows the count of missing, complete total and unique values. Depending on the data type it show some statistics on the distribution of data.

Thus, using just one function call I was able to learn about the data set as below.

Data set consists of name, sex, its occurrence by year . All fields are completed, thus no issues with missing values.
Data is available for 139 years from 1880 - 2018
Out of the 200K baby names over 139 years in USA, there are about 98.4K unique names.
A name has been re-used about 176 times on average over 139 years. However, the distribution of names' count is a skewed distribution with a long tail on right and a range between 5 - about 95K. That means, several names are much more popular than others.
The shortest name has 2 characters, while the longest have 11.

While summaries can be generated by summary() or str() functions, the information they provide to get a thorough understanding of the data set is limited (Figures 2 and 3).

Figure 3. Console output of str() in R

Exploring similar avenues in Python

Python has info() and describe() functions that would give a more or less similar details to str() and summary() in R (Figures 4 and 5).

Being spoiled by skim() in R, I looked for an alternative in Python and came across Google Facets https://pair-code.github.io/facets/. It is an opensource tool which you could either upload your data file to generate the summary, or embedded into Jupyter notebooks in Python. Summaries are generated as an 'Overview', similar to skim(), or even deeper as Dive.

Here's how to generate an overview of the data using Google Facets and Jupyter. Make sure the facets-overview package is installed in the python environment. The below code snippet is from https://github.com/PAIR-code/facets/tree/master/facets_overview. Make sure that the current data set is passed into ProtoFromDataFrames().

#@title Install the facets_overview pip package.
#!pip install facets-overview

# Create the feature stats for the datasets and stringify it.
import base64
from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator

gfsg = GenericFeatureStatisticsGenerator()
proto = gfsg.ProtoFromDataFrames([{'name': 'babynames', 'table': dat}])
protostr = base64.b64encode(proto.SerializeToString()).decode("utf-8")

# Display the facets overview visualization for this data
from IPython.core.display import display, HTML

HTML_TEMPLATE = """
        
        
        
        """
html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))

Figure 4. Summary of the data set generated by Google Facets

As shown in Figure 4, Google Facets provide similar information on the variable types, their total counts and count of missing, unique values and some statistics similar to skim(). In addition, I found the information provided on the top category very useful. For instance, we now know that the most popular name is William although the females dominate over males in the data set.

For more information on Google Facets: https://pair-code.github.io/facets/

Automate validation of tabular data sets and reports using R

Geethika Wijewardene — Tue, 10 Mar 2020 11:10:10 GMT

Data validation is a critical step to maintain the accuracy of an analysis or reporting. For instance, there could be erroneous or missing values in the input data due to poor quality of the data sources or errors could occur during the stage of the analysis where data sources are merged/joined or manipulated incorrectly. Thus, data validation can be or should be performed during data cleansing stage prior to analysis and/or at the reporting stage. Manual validation of tabular reports with even several hundred records is a time consuming and an error prone approach, while presence of errors in high stake reports is unacceptable and embarrassing.

Data validations can be carried out in various aspects, such as checking for data types, formats, uniqueness, presence of missing values where they are not accepted, cardinality checks, validation for data integrity and business rules/logic etc. Data validation is usually an automated process in data base systems, but the extent of validations may vary from one system to another. On the other hand, data bases are not the only data sources for analytical tasks. Thus, quality of a data set is always not guaranteed and validation is crucial in analytical work space. Automation of data validation largely contributes to efficient generation of high quality reports.

Example

In this simple example I present an automated process to validate a data set containing personal identification information (POI) using the Validate package in R.

Data Preparation

I created a data set of fictitious POI of 1000 people using the Generator package. The data set contains fields in Table 1 below.

NOTE: Over 18 column is a derived logical column from the dateofbirth column. Data created by the Generator package do not contain any erroneous data. Thus, I infused the data set with some possible errors, such as missing values, duplicates, typos, inconsistent formats etc., so that they will be picked up during data validation. The complete code for data generation can be found at https://github.com/geethika01/Data-Validation/blob/master/Data%20Validation.R.

Table 1: Summary description of the POI data set

The first few rows of the final data set is as below.

Table 2. First few rows of the POI data generated and infused with erroneous values

Summary of data issues is listed below.

Table 3. Summary of data issues infused into the data set

Data Validation

The Validate package checks the data according to a given set of rules. Thus, I first define rules for the data validations, which includes checks for data formats, missing values, uniqueness, and some logic listed in Table 2 above.

These rules are then summarized as labels in a vector of strings.

labels_lst <-c(
    "id - consists only 9 digits"
  , "id - unique"
  , "firstname - contains no digits"
  , "firstname - Uppercase"
  , "lastname - contains no digits"
  , "lastname - Uppercase"
  , "dateofbirth - ia a valid date in YYYY-mm-dd format and less than current date"
  , "email - in valid format"
  , "email - is unique"
  , "phone - in correct format (XXX XXX XXXX)"
  , "phone - is unique"
  , "gender - either M or F"
  , "over18 - valid values 1,0, NA and calculation is correct"
)

Secondly I evaluate each rule in R, which are also listed in a vector. NOTE: These rules and their corresponding labels in the previous vector should follow the same order. Also functions used in the rules are in the main script at https://github.com/geethika01/Data-validation/blob/master/Data%20Validation.R.

rules_lst <- c(
  # id
  "ifelse(!is.na(dat$id),(nchar(dat$id)== 9 &
                  grepl('[0-9]{9}', dat$id)),NA)== T"
  , "isDuplicated(dat$id)==T"
  # firstname
  , "ifelse(!is.na(dat$firstname), grepl('\\\\d', dat$firstname)==F, NA) == T"
  , "isUpperCase(dat$firstname)==T"
  # lastname
  , "ifelse(!is.na(dat$lastname), grepl('\\\\d', dat$lastname)==F, NA) == T"
  , "isUpperCase(dat$lastname)==T"
  # dateofbirth
  , "isValidDOBList(dat$dateofbirth)==T"
  # email
  , "isValidEmailList(dat$email)==T"
  , "isDuplicated(dat$email)==T"
  # Phone Number
  , "ifelse(!is.na(dat$phone), 
            (grepl('[0-9]{3}[ ][0-9]{3}[ ][0-9]{4}',dat$phone) &
                                          nchar(dat$phone) == 12), NA)==T"
  , "isDuplicated(dat$phone)==T"
  # gender
  , "ifelse(!is.na(dat$gender), dat$gender %in% c('M', 'F'), NA)==T"
  # over18
  , "isValidover18List(dat$dateofbirth,dat$over18)==T"
)

Now I create a data frame of the labels and rules and I validate the data set against the rules using the functions in the Validate package. The result_validation object provides an elegant summary of the count of the validated data in terms of number of passes, fails, missing values, errors in the rules and warnings as in Table 4 below.

df <- data.frame(label = labels_lst, rule = rules_lst)
v <- validator(.data = df)
cf <- confront(dat,v)
quality <- as.data.frame(summary(cf))
measure <- as.data.frame(v)
result_validation <- (merge(quality,measure)) %>% 
  select(label, items, passes, fails, nNA, error, warning)

The summary table (Table 4) can be used to readily identify the data issues in the tabular data. However, in order to identify the actual data with issues, it is useful to generate a more detail outcome as shown in Table 5.

Table 4. Summary of data validation

 fail_vals <- data.frame(values(cf))
  fail_vals <- as.matrix(fail_vals)
  fail_vals<- as.data.frame(which(fail_vals==0, arr.ind=TRUE))
  fail_vals <- mutate(fail_vals, label = labels_lst[fail_vals$col])%>% 
    select(-col) %>% mutate(id = dat[fail_vals$row, 1])
  vals <- c()  
  for (i in 1:nrow(fail_vals)){
    vals[i] <- dat[fail_vals$row[i], 
                   str_split(fail_vals$label[i]," - ")[[1]][1]]            
  }
  fail_vals <- cbind(fail_vals,vals)

Table 5. First few rows of the detail outcome of data with issues

The Validator package can be used to identify data issues as a summary at high level and at individual scale, so that they can be traced back and fixed if needed. There are more elegant ways, such as graphical representations, to summarize the validation results as presented in the references. Comparison of the Tables 3 and 4 shows that the infused data issues have all been captured by the validation rules.

In this example, the data validation rules I have implemented evaluates the data formats, data types and some business rules. However, I have not covered validation of data integration or merging of data source. One simpler approach to using the validate package for this kind of validation is to write two independent scripts to generate the same output tabular report using the same inputs and compare the outputs using the compareDF package.

References

Validate package - https://cran.r-project.org/web/packages/validate/vignettes/introduction.html

Data manipulation in an Excel File with Hyperlinks using R

Geethika Wijewardene — Sat, 12 Oct 2019 10:09:16 GMT

If data manipulation is carried out in R, why not creating the hyperlinks in R as well? Excel files use hyperlinks to navigate to external content, such as, urls or file paths to some other files. Excel uses HYPERLINK() function for this purpose. Below I present

how to create hyperlinks
how to update an excel file with hyperlinks in R

Part 1: Create excel reports with hyperlinks

Problem

How to create hyperlinks to external files in an excel workbook using R?

Solution

Here I present a simple scenario, where the hyperlinks are created next to the filename column of a worksheet using the writeFormula() in openxlsx package. For details and other scenarios of creating hyperlinks, visit https://rdrr.io/cran/openxlsx/man/makeHyperlinkString.html.

Example

I have generated a set of PDF files containing data on each country using the gapminder dataset. In the following code snippet, I first create a master table of country name and its PDF file name.

# Create Master table
country_lst <- unique(gapminder$country)
filename_lst <- paste0(country_lst, ".pdf")

df_master <- data.frame(country_lst, stringsAsFactors = F)
df_master <- cbind(df_master, filename_lst)
names(df_master) <- c("Country", "File_Name")
head(df_master)

##       Country       File_Name
## 1 Afghanistan Afghanistan.pdf
## 2     Albania     Albania.pdf
## 3     Algeria     Algeria.pdf
## 4      Angola      Angola.pdf
## 5   Argentina   Argentina.pdf
## 6   Australia   Australia.pdf

Now I create a workbook, write the master table and add hyperlinks using the writeFormular(). This function takes the HYPERLINK([link location], [friendly name]) excel formula as a string in the x argument. Thus, I generate this string dynamically for each row.

# Create an excel workbook and write data
wb <- createWorkbook()
addWorksheet(wb, "Countries")
writeData(wb,sheet = "Countries", x = df_master)

# Add hyperlinks to filenames
for(i in 2:length(country_lst)) {
  formula <- paste0('HYPERLINK(B',i, ', "Link to File")')
  writeFormula(wb, sheet ="Countries", startRow = i, startCol = 3
 , x = formula)
}

# Save the workbook
saveWorkbook(wb, "Gapminder_Countries.xlsx", overwrite = T)

Part 2: Update excel file with hyperlinks without touching the existing data

Problem

Forget about the above section, where I created hyperlinks. Now I have an excel file with hyperlinks to external files. I need to do some data manipulation and add a new column to this file. If I do the data manipulation in R and write the entire dataframe to a new file without configuring the hyperlinks as mentioned above, I will loose the hyperlinks. Hence, how can I write only the new columns to the existing file, such that the existing data are not touched?

Solution

I can do the data manipulation in R and write only the new columns to the existing file by specifying the range.

Example

Add the average change in life expectancy and population over 50 years (1957 - 2007) to the masterfile Gapminder_Countries.xlsx that I created in Part 1 above.

library(dplyr)
dat <- gapminder %>% group_by(country) %>% summarise(avg_change_LE =        round((max(lifeExp) - min(lifeExp))/50,1), avg_change_Pop = (max(pop) - min(pop))/50)

head(dat)

## # A tibble: 6 x 3
##   country     avg_change_LE avg_change_Pop
##                            
## 1 Afghanistan           0.3        469292.
## 2 Albania               0.4         46357.
## 3 Algeria               0.6        481074.
## 4 Angola                0.3        163768.
## 5 Argentina             0.3        448499.
## 6 Australia             0.2        234859.

Now I write only the avg_change_LE and avg_change_Pop columns to the existing Counties worksheet of the workbook. First I create a new dataframe selecting only the new columns. Data is NOT joined/merged using a common field when writing to the worksheet. Therefore, data in our dataframe and the worksheet need to follow the same order without gaps. Also make sure to specify the correct start column and row where the new columns need to be dumped.

All the material of this example are at https://github.com/geethika01/data-manipulation-with-R .