<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Few More Thoughts]]></title><description><![CDATA[Few More Thoughts about data, automation and related bits from a curious mind]]></description><link>https://fewmorethoughts.com</link><generator>RSS for Node</generator><lastBuildDate>Thu, 09 Apr 2026 14:12:45 GMT</lastBuildDate><atom:link href="https://fewmorethoughts.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Data Extraction from Outlook Attachments using R]]></title><description><![CDATA[When excel files get delivered through e-mail attachments, how can we extract the data and consolidate into a single table? 
Here I present an automated process to extract the attachments from Outlook emails and consolidate them using R. I use the RD...]]></description><link>https://fewmorethoughts.com/data-extraction-from-outlook-attachments-using-r</link><guid isPermaLink="true">https://fewmorethoughts.com/data-extraction-from-outlook-attachments-using-r</guid><category><![CDATA[Data Science]]></category><category><![CDATA[data analysis]]></category><category><![CDATA[automation]]></category><category><![CDATA[R Language]]></category><dc:creator><![CDATA[Geethika Wijewardene]]></dc:creator><pubDate>Thu, 11 Nov 2021 10:37:23 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1648376827522/Zmcjqhsj0.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When excel files get delivered through e-mail attachments, how can we extract the data and consolidate into a single table? </p>
<p>Here I present an automated process to extract the attachments from Outlook emails and consolidate them using <strong>R</strong>. I use the <code>RDCOMClient</code> package (https://github.com/omegahat/RDCOMClient or https://www.stat.berkeley.edu/~nolan/stat133/Fall05/lectures/DCOM.html). Thus, this solution will work only on <em>Windows</em>.</p>
<p>For instance, my research team is stationed at remote areas, where they have no access to internet. They record measurements hourly and record them in an excel template. At the end of the day, they will email me the excel file as an attachment with a common subject (<code>REA0001 - Measurements</code>). I need to extract these hourly measurements into one table for analysis. If I receive 50 emails in a day, am I going to manually open each email attachment and, copy the data into a file? This is a very time consuming, tedious and error prone approach. Thus, I would use the following piece of code to automate my job.</p>
<h2 id="heading-step-1-extract-emails-with-the-same-subject-from-outlook-mailbox">Step 1 : Extract emails with the same subject from outlook mailbox</h2>
<p>In this example, every email is sent by the same subject. Therefor, I use the subject to search for the email in the mail box. Also the Outlook application needs to be opened while running the code.</p>
<pre><code>library(RDCOMClient)
library(dplyr)
library(stringr)

working_dir&lt;-"C:/Users/geethika.wijewardena/Workspace/R-extract-email-attachments/"

#<span class="hljs-comment">--------------------------------------------</span>
# Extract emails <span class="hljs-keyword">from</span> outlook
#<span class="hljs-comment">--------------------------------------------</span>
# <span class="hljs-keyword">Create</span> a <span class="hljs-built_in">new</span> instance <span class="hljs-keyword">of</span> Outlook COM <span class="hljs-keyword">server</span> <span class="hljs-keyword">class</span>
outlook_app &lt;- COMCreate("Outlook.Application")
# <span class="hljs-keyword">Create</span> a <span class="hljs-keyword">search</span> <span class="hljs-keyword">object</span> <span class="hljs-keyword">to</span> <span class="hljs-keyword">search</span> the mail <span class="hljs-type">box</span> <span class="hljs-keyword">by</span> given criteria (e.g. subject)
<span class="hljs-keyword">search</span> &lt;- outlook_app$AdvancedSearch(
  "Inbox",
  "urn:schemas:httpmail:subject = 'REA0001 - Measurements'"
)
# Allow <span class="hljs-keyword">some</span> <span class="hljs-type">time</span> <span class="hljs-keyword">for</span> the <span class="hljs-keyword">search</span> <span class="hljs-keyword">to</span> complete
Sys.sleep(<span class="hljs-number">5</span>)
results &lt;- <span class="hljs-keyword">search</span>$Results()
</code></pre><h2 id="heading-step-2-filter-emails-by-date-and-extract-data-in-attachment">Step 2: Filter emails by date and extract data in attachment</h2>
<p>The <code>results</code> object above contains all emails with the given subject. However, since I need only the ones I received today, I filter the emails by date. Next, for each email, I save the attachment. I present two approaches to save the attachment by:  </p>
<ul>
<li>a) filename of the attachment and </li>
<li>b) name of the sender (in case if the filename is inconsistent).</li>
</ul>
<p><strong>In approach (a)</strong>, each saved attachment is read and loaded into a dynamic variable of its filename within the loop The <code>Measurement</code> field is renamed by the filename. Later, all these tibbles are joined/consolidated into a single table.</p>
<pre><code>#<span class="hljs-comment">------------------------------------------------------------------------------</span>
# Approach (a)
# Extract emails <span class="hljs-keyword">and</span> save the attachment <span class="hljs-keyword">by</span> the <span class="hljs-type">name</span> <span class="hljs-keyword">of</span> the attachment
#<span class="hljs-comment">------------------------------------------------------------------------------</span>

# <span class="hljs-keyword">Filter</span> <span class="hljs-keyword">search</span> results <span class="hljs-keyword">by</span> receive <span class="hljs-type">date</span>
<span class="hljs-keyword">for</span> (i <span class="hljs-keyword">in</span> <span class="hljs-number">1</span>:results$Count()){
  receive_date &lt;- <span class="hljs-keyword">as</span>.Date("1899-12-30") + floor(results$Item(i)$ReceivedTime())
  <span class="hljs-keyword">if</span>(receive_date &gt;= <span class="hljs-keyword">as</span>.Date("2019-10-09")) {
    # <span class="hljs-keyword">Get</span> the attachment <span class="hljs-keyword">of</span> <span class="hljs-keyword">each</span> email <span class="hljs-keyword">and</span> save it <span class="hljs-keyword">by</span> the <span class="hljs-type">name</span> <span class="hljs-keyword">of</span> the attachment
    #   <span class="hljs-keyword">in</span> a given file <span class="hljs-type">path</span>
    email &lt;- results$Item(i)
    attachment_file &lt;- paste0(working_dir,email$Attachments(<span class="hljs-number">1</span>)[[<span class="hljs-string">'DisplayName'</span>]])
    email$Attachments(<span class="hljs-number">1</span>)$SaveAsFile(attachment_file)

    # <span class="hljs-keyword">Read</span> <span class="hljs-keyword">each</span> attachment <span class="hljs-keyword">and</span> assign data <span class="hljs-keyword">into</span> a variable (which <span class="hljs-keyword">is</span> the filename)
    #   <span class="hljs-keyword">generated</span> dynamically, 
    df_name &lt;- str_sub(email$Attachments(<span class="hljs-number">1</span>)[[<span class="hljs-string">'DisplayName'</span>]],<span class="hljs-number">1</span>,<span class="hljs-number">-6</span>)
    data &lt;- readxl::read_excel(attachment_file, col_types =c("date", "numeric"),
                               col_names = T) %&gt;% 
      <span class="hljs-keyword">rename</span>(!!df_name := "Case")%&gt;% 
      mutate(Hour = str_sub(<span class="hljs-keyword">as</span>.character(Hour),<span class="hljs-number">11</span>,<span class="hljs-type">nchar</span>(<span class="hljs-keyword">as</span>.character(Hour))))
    assign(df_name, data)
  }
}

# Consolidate <span class="hljs-keyword">all</span> dataframes <span class="hljs-keyword">into</span> one
dat &lt;- lapply(ls(pattern="REA"), <span class="hljs-keyword">function</span>(x) <span class="hljs-keyword">get</span>(x)) %&gt;% 
  purrr::reduce(full_join, <span class="hljs-keyword">by</span> = "Hour")
</code></pre><p><strong>In approach (b)</strong>, getDataFromEmailAtt() function filters each email by the date, saves them by the name of the sender and returns the tibble with the <code>Measurement</code> field renamed by the sender's name. This function is called within a loop which joins/consolidates each data set into one table.</p>
<pre><code><span class="hljs-comment">#------------------------------------------------------------------------------</span>
<span class="hljs-comment"># Approach (b)</span>
<span class="hljs-comment"># Extract emails and save the attachment by the name of the sender</span>
<span class="hljs-comment">#------------------------------------------------------------------------------</span>
getDataFromEmailAtt&lt;- <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">results, i</span>)</span>{
  <span class="hljs-comment"># Function to extract data from email attachement, save it it a specified</span>
  <span class="hljs-comment"># directory by the name of the sender, read the saved excel file and return </span>
  <span class="hljs-comment"># a dataframe with a given colum named by the sender's name.</span>
  <span class="hljs-comment"># Args: results - object returned by search$Results() of RDCOMClient for </span>
  <span class="hljs-comment">#                 outlook applications.</span>
  <span class="hljs-comment">#       i - order number of the extracted emails in the results object </span>
  <span class="hljs-comment"># Returns: Dataset of the email attachment with given column renamed by the </span>
  <span class="hljs-comment">#          sender's name</span>
  <span class="hljs-comment"># </span>
  receive_date &lt;- <span class="hljs-keyword">as</span>.Date(<span class="hljs-string">"1899-12-30"</span>) + floor(results$Item(i)$ReceivedTime())
  <span class="hljs-keyword">if</span>(receive_date &gt;= <span class="hljs-keyword">as</span>.Date(<span class="hljs-string">"2019-10-09"</span>)) {
    <span class="hljs-comment"># Get the attachment of each email and save it by the name of the attachment</span>
    <span class="hljs-comment">#   in a given file path</span>
    email &lt;- results$Item(i)
    attachment_file &lt;- paste0(working_dir,email[[<span class="hljs-string">'SenderName'</span>]],<span class="hljs-string">'.xlsx'</span>)
    email$Attachments(<span class="hljs-number">1</span>)$SaveAsFile(attachment_file)

   data &lt;- readxl::read_excel(attachment_file, col_names = T) %&gt;% 
      rename(!!df_name := <span class="hljs-string">"Measurement"</span>)%&gt;% 
      mutate(Hour = str_sub(<span class="hljs-keyword">as</span>.character(Hour),<span class="hljs-number">11</span>,nchar(<span class="hljs-keyword">as</span>.character(Hour))))
  <span class="hljs-keyword">return</span>(data)
  }
}

<span class="hljs-comment"># Get the first dataset</span>
dat &lt;- getDataFromEmailAtt(results, i=<span class="hljs-number">1</span>)

<span class="hljs-comment"># Append datasets of the other emails</span>
<span class="hljs-keyword">for</span> (i in <span class="hljs-number">2</span>:results$Count()){
  data &lt;- getDataFromEmailAtt(results, i)
  dat &lt;- dat %&gt;% inner_join(data, by=c(<span class="hljs-string">'Hour'</span>))
</code></pre><p>Material of this example is at my GitHub repo https://github.com/geethika01/R-extract-email-attachments.</p>
]]></content:encoded></item><item><title><![CDATA[Can we ever accept the null hypothesis?]]></title><description><![CDATA[Not having enough evidence to reject the null hypothesis doesn't mean the null hypothesis is necessarily true. Here I explain why, using an example.
Students in a certain college are more inclined to use drugs than U.S. college students in general. T...]]></description><link>https://fewmorethoughts.com/can-we-ever-accept-the-null-hypothesis</link><guid isPermaLink="true">https://fewmorethoughts.com/can-we-ever-accept-the-null-hypothesis</guid><category><![CDATA[Data Science]]></category><category><![CDATA[data analysis]]></category><category><![CDATA[statistics]]></category><category><![CDATA[R Language]]></category><dc:creator><![CDATA[Geethika Wijewardene]]></dc:creator><pubDate>Thu, 28 Oct 2021 11:56:49 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1648381580753/0qGUQpuwq.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Not having enough evidence to reject the <strong>null hypothesis</strong> doesn't mean the null hypothesis is necessarily true. Here I explain why, using an example.</p>
<p>Students in a certain college are more inclined to use drugs than U.S. college students in general. The proportion of drug users among collage students in general is 0.157. We take two random samples of 100 and 400 students from the collage. The proportions of drug users in both samples is 0.19 (19/100 and 76/400). Since this proportion is higher than the population proportion  (0.157), can we declare that students in this collage are more inclined to use drugs?</p>
<h3 id="heading-hypothesis-testing">Hypothesis testing</h3>
<p><strong>Step 1: </strong>State the null hypothesis (H0) and the alternative hypothesis (Ha).</p>
<p><strong>Step 2: </strong> Collect relevant data from a random sample and summarize them (using a test statistic)</p>
<ul>
<li><p><strong>2.1</strong> - Check that the conditions under which the test can be reliably used ( n*p &gt;= 10 and n(1-p) &gt;= 10 are met.</p>
</li>
<li><p><strong>2.2</strong> - Calculate the test statistic</p>
</li>
</ul>
<p>Test statistic describes how far the observed sample proportion from the population proportion in standard deviations. It is calculated using the following formula.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1648381901932/QBDG-H83G.png" alt="image.png" /></p>
<p>Note: When we obtain a random sample of size n from a population with a population proportion p, the possible values of the sample proportion (p^), which is the sampling distribution of the proportions, is given by the mean (p) and standard deviation calculated by the following formula.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1648381928507/eFpFDtfAx.png" alt="image.png" /></p>
<p><strong>Step 3: </strong>Find the p-value, the probability of observing data like those observed assuming that Ho is true.</p>
<p><strong>Step 4: </strong>Based on the p-value, decide whether we have enough evidence to reject Ho (and accept Ha), and draw our conclusions in context.</p>
<p><strong>Test hypothesis as below</strong></p>
<p><strong>Step 1:</strong></p>
<p>Ho - Proportion of drug users in the collage is the same as the population proportion (p = p0)</p>
<p>Ha - Proportion of drug users in the collage is higher than the population proportion (p &gt; p0)</p>
<p><strong>Steps 2 and 3:</strong></p>
<p>Sample 1 : n = 100; mean proportion of the population (p) = 0.157; standard deviation = 0.018; observed proportion (p^) = 0.19</p>
<p>n<em>p = 100 </em> 0.19 = 19</p>
<p>n(1-p) = 100 * (1-0.19) =81</p>
<p>Sample 2 : n = 400; mean proportion of the population (p) = 0.157; standard deviation = 0.036; observed proportion (p^) = 0.19</p>
<p>n<em>p = 400</em>0.19 = 76</p>
<p>n(1-p) = 400(1-0.19) = 324</p>
<p>Calculate the test statistic and the p-value in R as below.</p>
<pre><code><span class="hljs-string">&gt;</span> <span class="hljs-comment"># Sample 1</span>
<span class="hljs-string">&gt;</span> <span class="hljs-string">p_1</span> <span class="hljs-string">&lt;-</span> <span class="hljs-string">prop.test(x=19,</span> <span class="hljs-string">n=100,</span> <span class="hljs-string">p=0.157,</span> <span class="hljs-string">alternative</span> <span class="hljs-string">=</span> <span class="hljs-string">"greater"</span><span class="hljs-string">,</span> <span class="hljs-string">conf.level</span> <span class="hljs-string">=</span> <span class="hljs-number">0.95</span><span class="hljs-string">,</span> <span class="hljs-string">correct</span> <span class="hljs-string">=</span> <span class="hljs-string">T)</span>
<span class="hljs-string">&gt;</span> <span class="hljs-string">p_1</span>

    <span class="hljs-number">1</span><span class="hljs-string">-sample</span> <span class="hljs-string">proportions</span> <span class="hljs-string">test</span> <span class="hljs-string">with</span> <span class="hljs-string">continuity</span> <span class="hljs-string">correction</span>

<span class="hljs-attr">data:</span>  <span class="hljs-number">19</span> <span class="hljs-string">out</span> <span class="hljs-string">of</span> <span class="hljs-number">100</span><span class="hljs-string">,</span> <span class="hljs-literal">null</span> <span class="hljs-string">probability</span> <span class="hljs-number">0.157</span>
<span class="hljs-string">X-squared</span> <span class="hljs-string">=</span> <span class="hljs-number">0.59236</span><span class="hljs-string">,</span> <span class="hljs-string">df</span> <span class="hljs-string">=</span> <span class="hljs-number">1</span><span class="hljs-string">,</span> <span class="hljs-string">p-value</span> <span class="hljs-string">=</span> <span class="hljs-number">0.2208</span>
<span class="hljs-attr">alternative hypothesis:</span> <span class="hljs-literal">true</span> <span class="hljs-string">p</span> <span class="hljs-string">is</span> <span class="hljs-string">greater</span> <span class="hljs-string">than</span> <span class="hljs-number">0.157</span>
<span class="hljs-attr">95 percent confidence interval:</span>
 <span class="hljs-number">0.1297316</span> <span class="hljs-number">1.0000000</span>
<span class="hljs-attr">sample estimates:</span>
   <span class="hljs-string">p</span> 
<span class="hljs-number">0.19</span> 
<span class="hljs-string">&gt;</span> <span class="hljs-comment"># Sample 2</span>
<span class="hljs-string">&gt;</span> <span class="hljs-string">p_2</span> <span class="hljs-string">&lt;-</span> <span class="hljs-string">prop.test(x=76,</span> <span class="hljs-string">n=400,</span> <span class="hljs-string">p=0.157,</span> <span class="hljs-string">alternative</span> <span class="hljs-string">=</span> <span class="hljs-string">"greater"</span><span class="hljs-string">,</span> <span class="hljs-string">conf.level</span> <span class="hljs-string">=</span> <span class="hljs-number">0.95</span><span class="hljs-string">,</span> <span class="hljs-string">correct</span> <span class="hljs-string">=</span> <span class="hljs-string">T)</span>
<span class="hljs-string">&gt;</span> <span class="hljs-string">p_2</span>

    <span class="hljs-number">1</span><span class="hljs-string">-sample</span> <span class="hljs-string">proportions</span> <span class="hljs-string">test</span> <span class="hljs-string">with</span> <span class="hljs-string">continuity</span> <span class="hljs-string">correction</span>

<span class="hljs-attr">data:</span>  <span class="hljs-number">76</span> <span class="hljs-string">out</span> <span class="hljs-string">of</span> <span class="hljs-number">400</span><span class="hljs-string">,</span> <span class="hljs-literal">null</span> <span class="hljs-string">probability</span> <span class="hljs-number">0.157</span>
<span class="hljs-string">X-squared</span> <span class="hljs-string">=</span> <span class="hljs-number">3.0466</span><span class="hljs-string">,</span> <span class="hljs-string">df</span> <span class="hljs-string">=</span> <span class="hljs-number">1</span><span class="hljs-string">,</span> <span class="hljs-string">p-value</span> <span class="hljs-string">=</span> <span class="hljs-number">0.04045</span>
<span class="hljs-attr">alternative hypothesis:</span> <span class="hljs-literal">true</span> <span class="hljs-string">p</span> <span class="hljs-string">is</span> <span class="hljs-string">greater</span> <span class="hljs-string">than</span> <span class="hljs-number">0.157</span>
<span class="hljs-attr">95 percent confidence interval:</span>
 <span class="hljs-number">0.1586989</span> <span class="hljs-number">1.0000000</span>
<span class="hljs-attr">sample estimates:</span>
   <span class="hljs-string">p</span> 
<span class="hljs-number">0.19</span>
</code></pre><p>According to sample 1 ( n= 100 and p-value =0.22 &gt;0.05) , it is very likely that we get a sample of 100 students with a proportion of drug users similar to 0.157. Thus, <strong>we do not have enough evidence to reject Ho</strong>, or to state that  'proportion of drug users in the collage is <strong>higher</strong> than the population proportion'. Therefore, can we accept the null hypothesis?</p>
<p>With a sample of 400 students, the p-value (0.04 &lt; 0.05)  suggests that it is very unlikely that the proportion of drug users will be 0.157. Now we have enough evidence to reject Ho and state that the  'proportion of drug users in the collage is <strong>higher</strong> than the population proportion'.</p>
<p>Therefore, <strong>when the p-value of a sample is higher than 0.05, we never can accept Ho, but only state that we do not have enough evidence to reject Ho</strong>. It might be that the sample size was simply too small to detect a statistically significant difference, or in other words, a larger sample of same proportion can provide evidence to reject the Ho or to detect a statistically significant difference. <strong>As the sample size increases, results become more significant.</strong></p>
]]></content:encoded></item><item><title><![CDATA[Bayes Rule - Notes and Examples]]></title><description><![CDATA[What are my chances of being pregnant if the over-the-counter pregnancy test turns out to be positive? What are my chances of getting cancer if I smoke? Or what are my chances of having cancer if my mammogram is negative? Bayes rule can be used to an...]]></description><link>https://fewmorethoughts.com/bayes-rule-notes-and-examples</link><guid isPermaLink="true">https://fewmorethoughts.com/bayes-rule-notes-and-examples</guid><category><![CDATA[Data Science]]></category><category><![CDATA[data analysis]]></category><category><![CDATA[statistics]]></category><dc:creator><![CDATA[Geethika Wijewardene]]></dc:creator><pubDate>Fri, 20 Aug 2021 12:43:14 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1648380941449/mZFJYYdf_.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>What are my chances of being pregnant if the over-the-counter pregnancy test turns out to be positive? What are my chances of getting cancer if I smoke? Or what are my chances of having cancer if my mammogram is negative? <strong>Bayes rule</strong> can be used to answer....</p>
<h3 id="heading-brush-up-on-conditional-probability">Brush up on Conditional Probability</h3>
<p>Conditional probability is when an event occurring, assuming that one or more other events have already occurred. If two events are independent of each other, then P(B|A) = P(B). On the other hand if event B is dependent on event A, then P(B|A) is as below.</p>
<p>P(B|A) =  P(A intersect B)/P(A)</p>
<p>NOTE: P(A and B) is the same as P(A intersect B).</p>
<p>Example: Out of 1000 people, Democratic Male = 200; Democratic Female = 300; Republican Male = 300 and Republican Female = 200.</p>
<p>A = Being a democrat and B being a women</p>
<p>P(A and B) = 300/1000 = 0.3 = 30%</p>
<p>P(B|A) = P(A and B)/ P(A) = 0.3/0.5 = 0.6 = 60%</p>
<h3 id="heading-bayes-rule">Bayes Rule</h3>
<p>Update the probability of happening of an event given a new piece of evidence.</p>
<p>For example, in 2011, there were 98 pregnancies for every 1,000 women (9.8%) aged 15–44 in the United States. 88% of the pregnancies have been positively detected by the over-the-counter pregnancy tests, while 95% of negative responses of these tests have been identified as not pregnant. Given that a test is positive, what are my chances of being pregnant?</p>
<h3 id="heading-terminology">Terminology</h3>
<p><strong>Prior probability/ Base Rate:</strong> P(Preg=T) - Pregnant women= 9.8%</p>
<p><strong>Posterior probability:</strong> P(Preg=T|Test = Pos) - Given a pregnancy test is positive, what is the probability of being pregnant?</p>
<p><strong>Likelihood/ Sensitivity:</strong> P(Test = Pos|Preg=T) - Given a woman is pregnant, what is the probability of the test beign positive?</p>
<p><strong>Evidence/Marginal Likelihood:</strong> P(Test=Pos) - total probability of observing the evidence (i.e.Probability of having a test positive)</p>
<p><strong>Specificity:</strong> P(Test = Neg|Preg = F)- given a woman is not pregnant, what is the probability of the test being negative?</p>
<p>'Pr' = Pregnant 'not Pr' = not pregnant 'Pos' = Test is positive 'Neg' = Test is negative</p>
<p>P(Pos | Pr) = P(Pos and Pr)/ P(Pr)</p>
<p>P(Pr| Pos) = P(Pr and Pos)/ P(Pos)</p>
<p>But, P(Pos and Pr) = P(Pr  and Pos)</p>
<p><strong>Therefore,  P(Pr|Pos) = P(Pos | Pr) * P(Pr) / P(Pos)</strong></p>
<p>When the denominator (P(Pos)) is not available, we can calculate it by</p>
<p>P(Pos) = P( Pos |Pr) <em> P(Pr) + P( Pos | not Pr) </em> P(not Pr)</p>
<p>where P( Pos |not Pr) = 1 - P(Neg|not Pr)</p>
<pre><code><span class="hljs-comment"># Function to calculate Bayes Rule in Python </span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">calcProbBayesRule</span>(<span class="hljs-params">prob_prior, prob_sensitivity, prob_evidence = None, prob_specificity = None</span>):</span>    
    <span class="hljs-keyword">if</span> prob_evidence <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:
        <span class="hljs-keyword">if</span> prob_specificity <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:
            <span class="hljs-keyword">raise</span> ValueError (<span class="hljs-string">'prob_specificity cannot be None when prob_evidence is None'</span>)
        <span class="hljs-keyword">else</span>:
            prob_not_prior = <span class="hljs-number">1</span> - prob_prior                        
            prob_evidence = (prob_sensitivity * prob_prior) +
                            ((<span class="hljs-number">1</span>-prob_specificity) * prob_not_prior)            
            prob_posterior = prob_sensitivity * prob_prior/prob_evidence      
    <span class="hljs-keyword">else</span>:
         prob_posterior = prob_sensitivity * prob_prior/prob_evidence          
    <span class="hljs-keyword">return</span> (str(round(prob_posterior * <span class="hljs-number">100</span>,<span class="hljs-number">2</span>)) + <span class="hljs-string">'%'</span>)
</code></pre><h3 id="heading-example-1-when-the-denominator-is-known">Example 1: When the denominator is known</h3>
<p><strong>Cancer and Smoking :</strong> 5% of the population has cancer and 10% of the population are smokers. Also 20% of the people with cancer are smokers. Given that a person is a smoker, what is the probability that he/she will get cancer?</p>
<p>P(C) = 0.05 P(S) = 0.1</p>
<p>P(S|C) = 0.2</p>
<p>P(C|S) = 0.2 * 0.05/0.1 = 0.1 (10%)</p>
<pre><code><span class="hljs-comment"># Calculation in Python</span>
prob_cancer_given_smoking = calcProbBayesRule(<span class="hljs-number">0.05</span>, <span class="hljs-number">0.2</span>, prob_evidence= <span class="hljs-number">0.1</span>)
print(prob_cancer_given_smoking)
<span class="hljs-string">'10.0%'</span>
</code></pre><h3 id="heading-example-2-when-the-denominator-is-unknown">Example 2: When the denominator is unknown</h3>
<p><strong>Breast cancer and mammograms:</strong> 1% of women have breast cancer. 80% of mammograms detect breast cancer when it is there (and therefore 20% miss it). 9.6% of mammograms detect breast cancer when it’s not there (and therefore 90.4% correctly return a negative result).</p>
<p>P(C) = 0.01  ;  P(not C) =0.99</p>
<p>P( Test= T|C ) = 0.8  ;  P(Test=F|C) = 1 - P( Test= T|C ) = 0.2</p>
<p>P(Test = T| not C) = 0.096   ; P(Test=F | not C) = 0.904</p>
<p>a) For a woman whose mammogram return positive, what is the probability of getting breast cancer?</p>
<p>P(C|Test = T) = P(Test = T|C) * P(C)/ P(Test=T)</p>
<p>Since P(Test=T) is not given, it is derived by,</p>
<p>P(Test = T) = P(Test=T|C) <em> P(C) + P(Test=T|not C) </em> P(not C)</p>
<p>P(Test = T) = (0.8 <em> 0.01)+ (0.096 </em> 0.99) = 0.103</p>
<p>P(C|Test = T) = 0.8 * 0.01/0.103 = 0.0776 (7.76%)</p>
<pre><code><span class="hljs-comment"># Calculation in Python</span>
prob_cancer_given_test_positive = calcProbBayesRule(<span class="hljs-number">0.01</span>, <span class="hljs-number">0.8</span>, prob_specificity= <span class="hljs-number">0.904</span>)
print(prob_cancer_given_test_positive)
<span class="hljs-string">'7.76%'</span>
</code></pre><p>Therefore, for a woman whose mammogram return positive there is only 8% chance of having cancer.</p>
<p>b) For a women whose mammogram return negative, what is the probability of getting cancer?</p>
<p>P(C| Test=F) = P(Test = F|C) * P(C)/ P(Test = F)</p>
<p>P(Test=F) = 1 - P(test=T) = 0.9</p>
<p>P(C| Test=F) = 0.2 * 0.01/0.9 = 0.0022 (o.22%)</p>
<pre><code><span class="hljs-comment"># Calculation in Python</span>
prob_cancer_given_test_negative = calcProbBayesRule(<span class="hljs-number">0.01</span>, <span class="hljs-number">0.2</span>, prob_evidence= <span class="hljs-number">0.9</span>)
print(prob_cancer_given_test_negative)
<span class="hljs-string">'0.22%'</span>
</code></pre><p>Therefore, women whose mammogram return negative, there is only 0.22% probability of getting cancer.</p>
<p><strong>Pregnancy and over-the-counter tests:</strong> Referring to the example mentioned above,</p>
<p>P(Pr) = 0.098 P(not Pr) = 0.902</p>
<p>P(Pos | Pr) = 0.88 (Sensitivity = 88%)</p>
<p>P(Neg | not Pr) = 0.95 (Specificity = 95%)</p>
<p>P(Pos | not Pr) = 1 - P(Neg | not Pr) = 0.05</p>
<p>P(Pos) = (0.098 <em> 0.88) + (0.05 </em> 0.902) = 0.13134</p>
<p>P(Pr | Pos) = P(Pos | Pr) <em> P(Pr)/ P(Pos) = 0.88 </em> 0.098/0.13134 = 0.6566= 66%</p>
<pre><code><span class="hljs-comment"># Calculation in Python</span>
<span class="hljs-attribute">prob_preg_given_test_pos</span> = calcProbBayesRule(<span class="hljs-number">0</span>.<span class="hljs-number">098</span>, <span class="hljs-number">0</span>.<span class="hljs-number">88</span>,prob_specificity= <span class="hljs-number">0</span>.<span class="hljs-number">95</span> )
<span class="hljs-attribute">print</span>(prob_preg_given_test_pos)
<span class="hljs-attribute">65</span>.<span class="hljs-number">66</span>%
</code></pre><p>Therefore, by the percentage of pregnancies in USA in 2011, if the given over-the-counter test turned out to be positive, there is still only 66% chance of being pregnant, whether you like it or not!!!</p>
]]></content:encoded></item><item><title><![CDATA[How to learn about an unknown data set quickly? - R and Python]]></title><description><![CDATA[When you come across an unknown data set, it is important to get to know about it before running into analysis. For instance, knowing the available fields, their data types, count of missing, unique or completed values and their distributions and pre...]]></description><link>https://fewmorethoughts.com/how-to-learn-about-an-unknown-data-set-quickly-r-and-python</link><guid isPermaLink="true">https://fewmorethoughts.com/how-to-learn-about-an-unknown-data-set-quickly-r-and-python</guid><category><![CDATA[Data Science]]></category><category><![CDATA[data analysis]]></category><category><![CDATA[R Language]]></category><category><![CDATA[automation]]></category><dc:creator><![CDATA[Geethika Wijewardene]]></dc:creator><pubDate>Thu, 18 Feb 2021 11:32:55 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1648379732499/Zs7zANb4J.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When you come across an unknown data set, it is important to get to know about it before running into analysis. For instance, knowing the available fields, their data types, count of missing, unique or completed values and their distributions and presence/absence of outliers help to assess the suitability of the data set for the targeted analysis or where it needs cleaning. </p>
<p>R and Python have these functionalities readily available at various levels of detail.</p>
<p>R is my main EDA tool as of now and I am a big fan of <code>tidyverse</code>. When I first come across a data set in R, I usually use <code>skim()</code> function of the <code>skimr</code> package to get to know about the data set. Trying to find a similar function in <code>Pandas</code> was a frustrating experience until I came across <code>Google Facets</code>. In this post, I will first introduce you to <code>skim()</code> and show how to use <code>Google Facets</code> to get a similar outcome in Python.</p>
<h2 id="heading-why-skim-in-r">Why skim() in R?</h2>
<p><code>skim()</code> is great to learn about the variables, their data type, missing values, unique values and some statistics on the distribution of variables of different types. Let me show you in an example below.</p>
<p>I use the <code>Baby Names from Social Security Card Applications - National Data</code> data set downloaded from https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1648379942765/_fDXkpx5B.png" alt="image.png" /><em>Figure 1. Summary of the data set from skim() in R</em></p>
<p>As shown in Figure 1, <code>skim()</code> lists the dimensions of the data set first, then groups variables by their data types and shows the count of missing, complete total and unique values.  Depending on the data type it show some statistics on the distribution of data.</p>
<p>Thus, using just one function call I was able to learn about the data set as below.</p>
<ol>
<li>Data set consists of name, sex, its occurrence by year . All fields are completed, thus no issues with missing values. </li>
<li>Data is available for 139 years from 1880 - 2018</li>
<li>Out of the 200K baby names over 139 years in USA, there are about 98.4K unique names.</li>
<li>A name has been re-used about 176 times on average over 139 years. However,  the distribution of names' count is a skewed distribution with a long tail on right and a range between 5 - about 95K. That means, several names are much more popular than others.</li>
<li>The shortest name has 2 characters, while the longest have 11.</li>
</ol>
<p>While summaries can be generated by <code>summary()</code> or <code>str()</code> functions, the information they provide to get a thorough understanding of the data set is limited (Figures 2 and 3).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1648380340489/Q4HDIgZ1R.png" alt="image.png" /><em>Figure 3. Console output of str() in R</em></p>
<h2 id="heading-exploring-similar-avenues-in-python">Exploring similar avenues in Python</h2>
<p>Python has <code>info()</code> and <code>describe()</code> functions that would give a more or less similar details to <code>str()</code> and <code>summary()</code> in R (Figures 4 and 5).</p>
<p>Being spoiled by <code>skim()</code> in R, I looked for an alternative in Python and came across <code>Google Facets</code> https://pair-code.github.io/facets/. It is an opensource tool which you could either upload your data file to generate the summary, or embedded into Jupyter notebooks in Python. Summaries are generated as an '<em>Overview</em>', similar to <code>skim()</code>, or even deeper as <code>Dive</code>. </p>
<p>Here's how to generate an overview of the data using Google Facets and Jupyter. Make sure the <code>facets-overview</code> package is installed in the python environment. The below code snippet is from https://github.com/PAIR-code/facets/tree/master/facets_overview. Make sure that the current data set is passed into <code>ProtoFromDataFrames()</code>.</p>
<pre><code><span class="hljs-comment">#@title Install the facets_overview pip package.</span>
<span class="hljs-comment">#!pip install facets-overview</span>

<span class="hljs-comment"># Create the feature stats for the datasets and stringify it.</span>
<span class="hljs-keyword">import</span> base64
<span class="hljs-keyword">from</span> facets_overview.generic_feature_statistics_generator <span class="hljs-keyword">import</span> GenericFeatureStatisticsGenerator

gfsg = GenericFeatureStatisticsGenerator()
proto = gfsg.ProtoFromDataFrames([{<span class="hljs-string">'name'</span>: <span class="hljs-string">'babynames'</span>, <span class="hljs-string">'table'</span>: dat}])
protostr = base64.b64encode(proto.SerializeToString()).decode(<span class="hljs-string">"utf-8"</span>)

<span class="hljs-comment"># Display the facets overview visualization for this data</span>
<span class="hljs-keyword">from</span> IPython.core.display <span class="hljs-keyword">import</span> display, HTML

HTML_TEMPLATE = <span class="hljs-string">"""
        &lt;script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"&gt;&lt;/script&gt;
        &lt;link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html" &gt;
        &lt;facets-overview id="elem"&gt;&lt;/facets-overview&gt;
        &lt;script&gt;
          document.querySelector("#elem").protoInput = "{protostr}";
        &lt;/script&gt;"""</span>
html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1648380605890/BcIXyziMY.png" alt="image.png" /><em>Figure 4. Summary of the data set generated by Google Facets</em></p>
<p>As shown in Figure 4, <code>Google Facets</code> provide similar information on the variable types, their total counts and count of missing, unique values and some statistics similar to <code>skim()</code>. In addition, I found the information provided on the top category very useful. For instance, we now know that the most popular name is <strong>William</strong> although the females dominate over males in the data set.  </p>
<p><strong>For more information on Google Facets: </strong>https://pair-code.github.io/facets/</p>
]]></content:encoded></item><item><title><![CDATA[Automate validation of tabular data sets and reports using R]]></title><description><![CDATA[Data validation is a critical step to maintain the accuracy of an analysis or reporting. For instance, there could be erroneous or missing values in the input data due to poor quality of the data sources or errors could occur during the stage of the ...]]></description><link>https://fewmorethoughts.com/automate-validation-of-tabular-data-sets-and-reports-using-r</link><guid isPermaLink="true">https://fewmorethoughts.com/automate-validation-of-tabular-data-sets-and-reports-using-r</guid><category><![CDATA[R Language]]></category><category><![CDATA[automation]]></category><category><![CDATA[data analysis]]></category><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Geethika Wijewardene]]></dc:creator><pubDate>Tue, 10 Mar 2020 11:10:10 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1648378219776/lE8IYCydY.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Data validation is a critical step to maintain the accuracy of an analysis or reporting. For instance, there could be erroneous or missing values in the input data due to poor quality of the data sources or errors could occur during the stage of the analysis where data sources are merged/joined or manipulated incorrectly.  Thus, data validation can be or should be performed during data cleansing stage prior to analysis and/or at the reporting stage. Manual validation of tabular reports with even several hundred records is a time consuming and an error prone approach, while presence of errors in high stake reports is unacceptable and embarrassing.  </p>
<p>Data validations can be carried out in various aspects, such as checking for data types, formats, uniqueness, presence of missing values where they are not accepted, cardinality checks, validation for data integrity and business rules/logic etc. Data validation is usually an automated process in data base systems, but the extent of validations may vary from one system to another. On the other hand, data bases are not the only data sources for analytical tasks. Thus, quality of a data set is always not guaranteed and validation is crucial in analytical work space. Automation of data validation largely contributes to efficient generation of high quality reports.</p>
<h2 id="heading-example">Example</h2>
<p>In this simple example I present an automated process to validate a data set containing personal identification information (POI) using the Validate package in R.</p>
<h3 id="heading-data-preparation">Data Preparation</h3>
<p>I created a data set of fictitious POI of 1000 people using the Generator package. The data set contains fields in Table 1 below. </p>
<p><strong>NOTE: </strong><code>Over 18</code> column is a derived logical column from the  <code>dateofbirth</code> column. Data created by the Generator package do not contain any erroneous data. Thus, I infused the data set with some possible errors, such as missing values, duplicates, typos, inconsistent formats etc., so that they will be picked up during data validation. The complete code for data generation can be found at  https://github.com/geethika01/Data-Validation/blob/master/Data%20Validation.R.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1648378447027/S_amSHhpj.png" alt="image.png" /><em>Table 1: Summary description of the POI data set</em></p>
<p>The first few rows of the final data set is as below.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1648378921949/VrDTIAzum.png" alt="image.png" /><em>Table 2. First few rows of the POI data generated and infused with erroneous values</em></p>
<p>Summary of data issues is listed below.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1648378998039/cYq7KGmuU.png" alt="image.png" /><em>Table 3. Summary of data issues infused into the data set</em></p>
<h3 id="heading-data-validation">Data Validation</h3>
<p>The Validate package checks the data according to a given set of rules. Thus, I first define rules for the data validations, which includes checks for data formats, missing values, uniqueness, and some logic listed in Table 2 above.</p>
<p>These rules are then summarized as labels in a vector of strings.</p>
<pre><code>labels_lst <span class="hljs-operator">&lt;</span><span class="hljs-operator">-</span>c(
    <span class="hljs-string">"id - consists only 9 digits"</span>
  , <span class="hljs-string">"id - unique"</span>
  , <span class="hljs-string">"firstname - contains no digits"</span>
  , <span class="hljs-string">"firstname - Uppercase"</span>
  , <span class="hljs-string">"lastname - contains no digits"</span>
  , <span class="hljs-string">"lastname - Uppercase"</span>
  , <span class="hljs-string">"dateofbirth - ia a valid date in YYYY-mm-dd format and less than current date"</span>
  , <span class="hljs-string">"email - in valid format"</span>
  , <span class="hljs-string">"email - is unique"</span>
  , <span class="hljs-string">"phone - in correct format (XXX XXX XXXX)"</span>
  , <span class="hljs-string">"phone - is unique"</span>
  , <span class="hljs-string">"gender - either M or F"</span>
  , <span class="hljs-string">"over18 - valid values 1,0, NA and calculation is correct"</span>
)
</code></pre><p>Secondly I evaluate each rule in R, which are also listed in a vector.  NOTE: These rules and their corresponding labels in the previous vector should follow the same order. Also functions used in the rules are in the main script at https://github.com/geethika01/Data-validation/blob/master/Data%20Validation.R.</p>
<pre><code>rules_lst &lt;- c(
  <span class="hljs-comment"># id</span>
  <span class="hljs-string">"ifelse(!is.na(dat$id),(nchar(dat$id)== 9 &amp;
                  grepl('[0-9]{9}', dat$id)),NA)== T"</span>
  , <span class="hljs-string">"isDuplicated(dat$id)==T"</span>
  <span class="hljs-comment"># firstname</span>
  , <span class="hljs-string">"ifelse(!is.na(dat$firstname), grepl('\\\\d', dat$firstname)==F, NA) == T"</span>
  , <span class="hljs-string">"isUpperCase(dat$firstname)==T"</span>
  <span class="hljs-comment"># lastname</span>
  , <span class="hljs-string">"ifelse(!is.na(dat$lastname), grepl('\\\\d', dat$lastname)==F, NA) == T"</span>
  , <span class="hljs-string">"isUpperCase(dat$lastname)==T"</span>
  <span class="hljs-comment"># dateofbirth</span>
  , <span class="hljs-string">"isValidDOBList(dat$dateofbirth)==T"</span>
  <span class="hljs-comment"># email</span>
  , <span class="hljs-string">"isValidEmailList(dat$email)==T"</span>
  , <span class="hljs-string">"isDuplicated(dat$email)==T"</span>
  <span class="hljs-comment"># Phone Number</span>
  , <span class="hljs-string">"ifelse(!is.na(dat$phone), 
            (grepl('[0-9]{3}[ ][0-9]{3}[ ][0-9]{4}',dat$phone) &amp;
                                          nchar(dat$phone) == 12), NA)==T"</span>
  , <span class="hljs-string">"isDuplicated(dat$phone)==T"</span>
  <span class="hljs-comment"># gender</span>
  , <span class="hljs-string">"ifelse(!is.na(dat$gender), dat$gender %in% c('M', 'F'), NA)==T"</span>
  <span class="hljs-comment"># over18</span>
  , <span class="hljs-string">"isValidover18List(dat$dateofbirth,dat$over18)==T"</span>
)
</code></pre><p>Now I create a data frame of the labels and rules and I validate the data set against the rules using the functions in the Validate package. The result_validation object provides an elegant summary of the count of the validated data in terms of number of passes, fails, missing values, errors in the rules and warnings as in Table 4 below.</p>
<pre><code>df <span class="hljs-operator">&lt;</span><span class="hljs-operator">-</span> data.frame(label <span class="hljs-operator">=</span> labels_lst, rule <span class="hljs-operator">=</span> rules_lst)
v <span class="hljs-operator">&lt;</span><span class="hljs-operator">-</span> validator(.data <span class="hljs-operator">=</span> df)
cf <span class="hljs-operator">&lt;</span><span class="hljs-operator">-</span> confront(dat,v)
quality <span class="hljs-operator">&lt;</span><span class="hljs-operator">-</span> <span class="hljs-keyword">as</span>.data.frame(summary(cf))
measure <span class="hljs-operator">&lt;</span><span class="hljs-operator">-</span> <span class="hljs-keyword">as</span>.data.frame(v)
result_validation <span class="hljs-operator">&lt;</span><span class="hljs-operator">-</span> (merge(quality,measure)) <span class="hljs-operator">%</span><span class="hljs-operator">&gt;</span><span class="hljs-operator">%</span> 
  select(label, items, passes, fails, nNA, <span class="hljs-function"><span class="hljs-keyword">error</span>, <span class="hljs-title">warning</span>)</span>
</code></pre><p>The summary table (Table 4) can be used to readily identify the data issues in the tabular data. However, in order to identify the actual data with issues, it is useful to generate a more detail outcome as shown in Table 5.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1648379176420/qCdPPoudN.png" alt="image.png" /><em>Table 4. Summary of data validation</em></p>
<pre><code> fail_vals <span class="hljs-operator">&lt;</span><span class="hljs-operator">-</span> data.frame(values(cf))
  fail_vals <span class="hljs-operator">&lt;</span><span class="hljs-operator">-</span> <span class="hljs-keyword">as</span>.matrix(fail_vals)
  fail_vals<span class="hljs-operator">&lt;</span><span class="hljs-operator">-</span> <span class="hljs-keyword">as</span>.data.frame(which(fail_vals<span class="hljs-operator">=</span><span class="hljs-operator">=</span><span class="hljs-number">0</span>, arr.ind=TRUE))
  fail_vals <span class="hljs-operator">&lt;</span><span class="hljs-operator">-</span> mutate(fail_vals, label <span class="hljs-operator">=</span> labels_lst[fail_vals$col])<span class="hljs-operator">%</span><span class="hljs-operator">&gt;</span><span class="hljs-operator">%</span> 
    select(<span class="hljs-operator">-</span>col) <span class="hljs-operator">%</span><span class="hljs-operator">&gt;</span><span class="hljs-operator">%</span> mutate(id <span class="hljs-operator">=</span> dat[fail_vals$row, <span class="hljs-number">1</span>])
  vals <span class="hljs-operator">&lt;</span><span class="hljs-operator">-</span> c()  
  <span class="hljs-keyword">for</span> (i in <span class="hljs-number">1</span>:nrow(fail_vals)){
    vals[i] <span class="hljs-operator">&lt;</span><span class="hljs-operator">-</span> dat[fail_vals$row[i], 
                   str_split(fail_vals$label[i],<span class="hljs-string">" - "</span>)[[<span class="hljs-number">1</span>]][<span class="hljs-number">1</span>]]            
  }
  fail_vals <span class="hljs-operator">&lt;</span><span class="hljs-operator">-</span> cbind(fail_vals,vals)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1648379244697/LbxACYlE-.png" alt="image.png" /><em>Table 5. First few rows of the detail outcome of data with issues</em></p>
<p>The Validator package can be used to identify data issues as a summary at high level and at individual scale, so that they can be traced back and fixed if needed. There are more elegant ways, such as graphical representations, to summarize the validation results as presented in the references. Comparison of the Tables 3 and 4 shows that the infused data issues have all been captured by the validation rules.</p>
<p>In this example, the data validation rules I have implemented evaluates the data formats, data types and some business rules. However, I have not covered validation of data integration or merging of data source. One simpler approach to using the validate package for this kind of validation is to write two independent scripts to generate the same output tabular report using the same inputs and compare the outputs using the compareDF package.  </p>
<h3 id="heading-references">References</h3>
<p>Validate package - https://cran.r-project.org/web/packages/validate/vignettes/introduction.html</p>
]]></content:encoded></item><item><title><![CDATA[Data manipulation in an Excel File with Hyperlinks using R]]></title><description><![CDATA[If data manipulation is carried out in R, why not creating the hyperlinks in R as well? Excel files use hyperlinks to navigate to external content, such as, urls or file paths to some other files. Excel uses HYPERLINK() function for this purpose. Bel...]]></description><link>https://fewmorethoughts.com/data-manipulation-in-an-excel-file-with-hyperlinks-using-r</link><guid isPermaLink="true">https://fewmorethoughts.com/data-manipulation-in-an-excel-file-with-hyperlinks-using-r</guid><category><![CDATA[Data Science]]></category><category><![CDATA[R Language]]></category><category><![CDATA[automation]]></category><category><![CDATA[excel]]></category><category><![CDATA[data analysis]]></category><dc:creator><![CDATA[Geethika Wijewardene]]></dc:creator><pubDate>Sat, 12 Oct 2019 10:09:16 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1648360927252/zjF9U0JCt.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If data manipulation is carried out in R, why not creating the hyperlinks in R as well? Excel files use hyperlinks to navigate to external content, such as, urls or file paths to some other files. Excel uses <code>HYPERLINK()</code> function for this purpose. Below I present </p>
<ol>
<li>how to create hyperlinks </li>
<li>how to update an excel file with hyperlinks in R</li>
</ol>
<h3 id="heading-part-1-create-excel-reports-with-hyperlinks">Part 1: Create excel reports with hyperlinks</h3>
<p><strong>Problem</strong></p>
<p>How to create hyperlinks to external files in an excel workbook using R?</p>
<p><strong>Solution</strong></p>
<p>Here I present a simple scenario, where the hyperlinks are created next to the filename column of a worksheet using the <code>writeFormula()</code> in <code>openxlsx</code> package. For details and other scenarios of creating hyperlinks, visit  https://rdrr.io/cran/openxlsx/man/makeHyperlinkString.html.</p>
<p><strong>Example</strong></p>
<p>I have generated a set of PDF files containing data on each country using the <code>gapminder</code> dataset. In the following code snippet, I first create a master table of country name and its PDF file name.</p>
<pre><code># <span class="hljs-keyword">Create</span> Master <span class="hljs-keyword">table</span>
country_lst &lt;- <span class="hljs-keyword">unique</span>(gapminder$country)
filename_lst &lt;- paste0(country_lst, ".pdf")

df_master &lt;- data.frame(country_lst, stringsAsFactors = F)
df_master &lt;- cbind(df_master, filename_lst)
names(df_master) &lt;- c("Country", "File_Name")
head(df_master)
</code></pre><pre><code><span class="hljs-comment">##       Country       File_Name</span>
<span class="hljs-comment">## 1 Afghanistan Afghanistan.pdf</span>
<span class="hljs-comment">## 2     Albania     Albania.pdf</span>
<span class="hljs-comment">## 3     Algeria     Algeria.pdf</span>
<span class="hljs-comment">## 4      Angola      Angola.pdf</span>
<span class="hljs-comment">## 5   Argentina   Argentina.pdf</span>
<span class="hljs-comment">## 6   Australia   Australia.pdf</span>
</code></pre><p>Now I create a workbook, write the master table and add hyperlinks using the <code>writeFormular()</code>. This function takes the <code>HYPERLINK([link location], [friendly name])</code> excel formula as a string in the x argument. Thus, I generate this string dynamically for each row.</p>
<pre><code><span class="hljs-comment"># Create an excel workbook and write data</span>
<span class="hljs-attribute">wb</span> &lt;- createWorkbook()
addWorksheet(wb, <span class="hljs-string">"Countries"</span>)
writeData(wb,sheet = <span class="hljs-string">"Countries"</span>, x = df_master)

<span class="hljs-comment"># Add hyperlinks to filenames</span>
for(i in <span class="hljs-number">2</span>:length(country_lst)) {
  <span class="hljs-attribute">formula</span> &lt;- paste0(<span class="hljs-string">'HYPERLINK(B'</span>,i, <span class="hljs-string">', "Link to File")'</span>)
  writeFormula(wb, sheet =<span class="hljs-string">"Countries"</span>, startRow = i, startCol = <span class="hljs-number">3</span>
 , x = formula)
}

<span class="hljs-comment"># Save the workbook</span>
saveWorkbook(wb, <span class="hljs-string">"Gapminder_Countries.xlsx"</span>, overwrite = T)
</code></pre><h3 id="heading-part-2-update-excel-file-with-hyperlinks-without-touching-the-existing-data">Part 2: Update excel file with hyperlinks without touching the existing data</h3>
<p><strong>Problem</strong></p>
<p>Forget about the above section, where I created hyperlinks. Now I have an excel file with hyperlinks to external files. I need to do some data manipulation and add a new column to this file. If I do the data manipulation in R and write the entire dataframe to a new file without configuring the hyperlinks as mentioned above, I will loose the hyperlinks. Hence, how can I write only the new columns to the existing file, such that the existing data are not touched?</p>
<p><strong>Solution</strong></p>
<p>I can do the data manipulation in R and write only the new columns to the existing file by specifying the range.</p>
<p><strong>Example</strong></p>
<p>Add the average change in life expectancy and  population over 50 years (1957 - 2007) to the masterfile <code>Gapminder_Countries.xlsx</code> that I created in <strong>Part 1</strong> above.</p>
<pre><code><span class="hljs-class"><span class="hljs-keyword">library</span>(<span class="hljs-params">dplyr</span>)
<span class="hljs-title">dat</span> &lt;- <span class="hljs-title">gapminder</span> %&gt;% <span class="hljs-title">group_by</span>(<span class="hljs-params">country</span>) %&gt;% <span class="hljs-title">summarise</span>(<span class="hljs-params">avg_change_LE =        round(<span class="hljs-params">(<span class="hljs-params">max(<span class="hljs-params">lifeExp</span>) - min(<span class="hljs-params">lifeExp</span>)</span>)/<span class="hljs-number">50</span>,<span class="hljs-number">1</span></span>), avg_change_Pop = (<span class="hljs-params">max(<span class="hljs-params">pop</span>) - min(<span class="hljs-params">pop</span>)</span>)/<span class="hljs-number">50</span></span>)

<span class="hljs-title">head</span>(<span class="hljs-params">dat</span>)</span>
</code></pre><pre><code><span class="hljs-comment">## # A tibble: 6 x 3</span>
<span class="hljs-comment">##   country     avg_change_LE avg_change_Pop</span>
<span class="hljs-comment">##   &lt;fct&gt;               &lt;dbl&gt;          &lt;dbl&gt;</span>
<span class="hljs-comment">## 1 Afghanistan           0.3        469292.</span>
<span class="hljs-comment">## 2 Albania               0.4         46357.</span>
<span class="hljs-comment">## 3 Algeria               0.6        481074.</span>
<span class="hljs-comment">## 4 Angola                0.3        163768.</span>
<span class="hljs-comment">## 5 Argentina             0.3        448499.</span>
<span class="hljs-comment">## 6 Australia             0.2        234859.</span>
</code></pre><p>Now I write only the <code>avg_change_LE</code> and <code>avg_change_Pop</code> columns to the existing <code>Counties</code> worksheet of the workbook. First I create a new dataframe selecting only the new columns. Data is NOT joined/merged using a common field when writing to the worksheet. Therefore, data in our dataframe and the worksheet need to follow the same order without gaps. Also make sure to specify the correct start column and row where the new columns need to be dumped.</p>
<p>All the material of this example are at https://github.com/geethika01/data-manipulation-with-R .</p>
]]></content:encoded></item></channel></rss>