STRINGs &  FILEs - continuation

================================================================================

In order to show you that reading/writing of files and string processing can be 
quite handy, we will do something more practical (it will show also how fast C 
can be).

================================================================================

Recently, there was a major update of the human genome by T2T Consortium

For details see:
Nurk et al. "The complete sequence of a human genome." bioRxiv (May 27, 2021).
https://www.biorxiv.org/content/10.1101/2021.05.26.445798v1 
or 
http://dx.doi.org/10.1126/science.abj6987 (31 Mar 2022)

Find the repository (open bioRxiv paper and look for github repository)
Then locate the file to download (Ctrl+f): "chm13v2.0.fa.gz" (do not download it)

To speed up the process let's download ONE from the files hosted on the 'duch':
ls -lah
-rw-r--r-- 1 lukaskoz iinf 727M 11-16 13:57 chm13v2.0.fa.7z
-rw-r--r-- 1 lukaskoz iinf 936M 11-16 13:56 chm13v2.0.fa.gz
-rw-r--r-- 1 lukaskoz iinf 915M 11-16 13:56 chm13v2.0.fa.tar.gz

wget https://www.mimuw.edu.pl/~lukaskoz/teaching/wi/lab7/chm13v2.0.fa.tar.gz


The format of the file is FASTA
https://en.wikipedia.org/wiki/FASTA_format

WARM UP (terminal)

The file chm13v2.0.fa.gz is quite big.

Uncompress it.

Check how it looks? Make sample files for start etc.

ls -lah chm13v2.0.fa
wc chm13v2.0.fa
less chm13v2.0.fa
head chm13v2.0.fa
tail chm13v2.0.fa
wc -l chm13v2.0.fa
grep '>' chm13v2.0.fa
grep '>' chm13v2.0.fa -n
grep '>' chm13v2.0.fa|wc
head chm13v2.0.fa -n 10000 > chm13v2.0.fa_10k_lines
etc.

Make a sample files containing last 1 chromosome, 2 chromosomes, 
the last chromosome, etc.

MAIN TASK:
Write C program that will calculate:

(1) for full human genome:
- length
- nucleotides numbers and frequencies
- GC content
- number of chromosomes

(2) for each chromosome:
- length
- nucleotides numbers and frequencies
- GC content

The program should read line after line [use getline() function] and 
gather statistics. It should calculate stats and show them on the screen 
and additionally store them in a TXT file.

VERSION I: only point (1) - for exemplary output see genome_stats_short.txt
VERSION II: both point (1) & (2) - for exemplary output see genome_stats_full.txt

https://www.mimuw.edu.pl/~lukaskoz/teaching/wi/lab7/genome_stats_short.txt
https://www.mimuw.edu.pl/~lukaskoz/teaching/wi/lab7/genome_stats_full.txt

Hints:
- first calculate statistics and write them to the file (separate funtion: processFasta())
- read the output file (previously known function printFileContent())
- the number of nucleotides is quite big (use 'long long' instead 'int')

==========================================================================

Starting code, for instance:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/getline.html
https://pubs.opengroup.org/onlinepubs/9699919799/functions/fgets.html
 
or ... the code from previous labs