STRINGs & FILEs - continuation ================================================================================ In order to show you that reading/writing of files and string processing can be quite handy, we will do something more practical (it will show also how fast C can be). ================================================================================ Recently, there was a major update of the human genome by T2T Consortium For details see: Nurk et al. "The complete sequence of a human genome." bioRxiv (May 27, 2021). https://www.biorxiv.org/content/10.1101/2021.05.26.445798v1 or http://dx.doi.org/10.1126/science.abj6987 (31 Mar 2022) Find the repository (open bioRxiv paper and look for github repository) Then locate the file to download (Ctrl+f): "chm13v2.0.fa.gz" (do not download it) To speed up the process let's download ONE from the files hosted on the 'duch': ls -lah -rw-r--r-- 1 lukaskoz iinf 727M 11-16 13:57 chm13v2.0.fa.7z -rw-r--r-- 1 lukaskoz iinf 936M 11-16 13:56 chm13v2.0.fa.gz -rw-r--r-- 1 lukaskoz iinf 915M 11-16 13:56 chm13v2.0.fa.tar.gz wget https://www.mimuw.edu.pl/~lukaskoz/teaching/wi/lab7/chm13v2.0.fa.tar.gz The format of the file is FASTA https://en.wikipedia.org/wiki/FASTA_format WARM UP (terminal) The file chm13v2.0.fa.gz is quite big. Uncompress it. Check how it looks? Make sample files for start etc. ls -lah chm13v2.0.fa wc chm13v2.0.fa less chm13v2.0.fa head chm13v2.0.fa tail chm13v2.0.fa wc -l chm13v2.0.fa grep '>' chm13v2.0.fa grep '>' chm13v2.0.fa -n grep '>' chm13v2.0.fa|wc head chm13v2.0.fa -n 10000 > chm13v2.0.fa_10k_lines etc. Make a sample files containing last 1 chromosome, 2 chromosomes, the last chromosome, etc. MAIN TASK: Write C program that will calculate: (1) for full human genome: - length - nucleotides numbers and frequencies - GC content - number of chromosomes (2) for each chromosome: - length - nucleotides numbers and frequencies - GC content The program should read line after line [use getline() function] and gather statistics. It should calculate stats and show them on the screen and additionally store them in a TXT file. VERSION I: only point (1) - for exemplary output see genome_stats_short.txt VERSION II: both point (1) & (2) - for exemplary output see genome_stats_full.txt https://www.mimuw.edu.pl/~lukaskoz/teaching/wi/lab7/genome_stats_short.txt https://www.mimuw.edu.pl/~lukaskoz/teaching/wi/lab7/genome_stats_full.txt Hints: - first calculate statistics and write them to the file (separate funtion: processFasta()) - read the output file (previously known function printFileContent()) - the number of nucleotides is quite big (use 'long long' instead 'int') ========================================================================== Starting code, for instance: https://pubs.opengroup.org/onlinepubs/9699919799/functions/getline.html https://pubs.opengroup.org/onlinepubs/9699919799/functions/fgets.html or ... the code from previous labs