class:inverse middle center # *Week 9 - Control flow and working with files* ---- # II: Working with files <br> <br> <br> <br> <br> ### Jelmer Poelstra ### 2021/03/11 --- ## Getting set up 1. In the shell, look for your `CSB` dir (the book's GitHub repo) or download it again: ```sh git clone https://github.com/CSB-book/CSB.git # SHELL ``` 2. Trigger an interactive Python window from a Python script (<kbd>Shift</kbd>+<kbd>Enter</kbd>). 3. Move to the `python/sandbox` dir in CSB in Python. Assuming your CSB dir is at `/fs/ess/PAS1855/users/$USER/CSB`, and your Python script in `/fs/ess/PAS1855/users/$USER/week09`: ```py import os # PYTHON os.chdir('../CSB/python/sandbox') ## Or w/ an absolute dir, replace <username> w/ real username: # os.chdir('/fs/ess/PAS1855/users/<username>/CSB/python/sandbox') ## Check your current working dir: # os.getcwd() ``` --- ## CSB 3.7: Working with files In Python, we need to *explicitly open a file* before we can interact with it. Opening a file won't load it into memory, but will create a connection to the file. For instance, to open a file for *writing* (`w`): ```python my_filehandle = open('my_new_file.txt', 'w') ``` Now, `my_filehandle` is is a file object, also called a *file handle*. <br> <figure> <p align="center"> <img src=img/file_handle_output-3.png width="65%"> <figcaption>From O’Neil 2017 - A Primer for Computational Biology</figcaption> </p> </figure> --- ## Most commonly used file modes .center[ | Shorthand | File mode |-----------|---------- | `r` | Read | `w` | Write | `a` | Append | `r+` | Read + write ] <br> ```python open('my_new_file.txt', 'w') open('some_data.txt', 'r') ``` --- ## Writing to a file To actually write to a file, we can use the `write()` method to write a string to the file, or `writelines()` to write a list to the file. - However, in either case, we need to insert the newlines ourselves with `\n`: ```python s = 'My file contents' my_filehandle.write(s + "\n") my_filehandle.writelines(["A\n", "B\n", "C\n"]) ``` -- - When we are done with the file, **we need to close it**: ```python my_filehandle.close() ``` <br> .content-box-info[ To write a list to file one line at a time, use `join()`: ```py f.write("\n".join(mylist)) ``` ] --- ## A more convenient way: `with open()` - If we use the `with open()` construct, we don't have to manually close the file – instead, the file connection is open as long as our code is *indented*: ```python with open("my_new_file.txt", "r") as my_filehandle: for my_line in my_filehandle: # Loop over lines print(my_line) # Print line back #> My file contents #> #> A #> #> B #> #> C ``` - The above example also shows that **if we loop over the file handle, we automatically process one line at a time.** - Here, we simply printed the contents of the file to screen, but notice how we ended up with *double-spacing*. --- ## More newlines - To see what's going on with the double-spacing, let's read a single line using `readline()` and store the result in a variable: ```python with open('my_new_file.txt', 'r') as my_filehandle: line1 = my_filehandle.readline() line1 #> 'My file contents\n' print(line1) #> My file contents #> ``` -- - As it turns out, reading methods also return the newline characters `\n`. - `print()` parses the newline character and because it by default always inserts a newline anyway, we end up with double-spacing. - For single-space `print()`ing, or when processing a line, we can use `strip()` to remove newlines characters (and other whitespace): ```py line1 = my_filehandle.readline().strip() ``` --- ## Progressively reading a file - When reading a single line at a time using `readline()`, **reading is done progressively**, so if we call `readline()` twice, we will read the first and *then* the second line: ```python with open('../data/Singh2015_data.csv', 'r') as input_file: line1 = input_file.readline().strip() line2 = input_file.readline().strip() print(line1) print(line2) #> Line,InfectionStatus,RecombinantFraction #> 21,I,0.1826923077 ``` -- - Or to skip a line, for instance a header, we can use `next()`: ```python input_file.next() ``` --- ## Summary of basic options to read from a file - Note that with all approaches, line breaks are present in the resulting strings as `\n`. | Approach | Result | |----------------------------|------------------| | **`for <line> in <fhandle>:`** | Loop over lines one at a time. | **`readline()`** | Return current line as a string. | **`readlines()`** | Return a list of strings, one item per line. | **`read()`** | Return entire file content as a single string. --- ## Character-delimited (tabular) files We start by reading the first few lines of a CSV (comma-separated values) file: ```python with open("../data/Dalziel2016_data.csv") as dalziel_file: for i, line in enumerate(dalziel_file): print(line.strip()) if i > 2: break #> biweek,year,loc,cases,pop #> 1,1906,BALTIMORE,NA,526822.1365 #> 2,1906,BALTIMORE,NA,526995.246 #> 3,1906,BALTIMORE,NA,527170.1981 ``` - We didn't specify an option to `open()`, so it defaults to *read* mode. - With `enumerate()`, we can track line numbers along with line contents. - We break out of the loop, i.e. we stop it from running, after we've printed the line with a line index `i` of 3 – the effect is that we only print the first four lines. --- ## Using the `csv` module While we could manually split rows by commas, a "module" that can do this for us is `csv`, which we load using an `import` statement (more next week): ```python import csv with open("../data/Dalziel2016_data.csv") as dalziel_file: # We use the function "reader" from the csv module; # Note the csv.reader = <module>.<function> syntax reader = csv.reader(dalziel_file) for i, row in enumerate(reader): print(row) if i > 2: break #> ['biweek', 'year', 'loc', 'cases', 'pop'] #> ['1', '1906', 'BALTIMORE', 'NA', '526822.1365'] #> ['2', '1906', 'BALTIMORE', 'NA', '526995.246'] #> ['3', '1906', 'BALTIMORE', 'NA', '527170.1981'] ``` The output is a list for every row, with one item per field. --- ## Using the `csv` module (cont.) The `csv.DictReader()` function will return a *dictionary* for every row, containing the *column names as the keys* and the cell values as the values: ```python with open("../data/Dalziel2016_data.csv") as dalziel_file: reader = csv.DictReader(dalziel_file) for i, row in enumerate(reader): print(dict(row)) if i > 2: break #> {'biweek': '1', 'year': '1906', 'loc': 'BALTIMORE', 'cases': 'NA', 'pop': '526822.1365'} #> {'biweek': '2', 'year': '1906', 'loc': 'BALTIMORE', 'cases': 'NA', 'pop': '526995.246'} #> {'biweek': '3', 'year': '1906', 'loc': 'BALTIMORE', 'cases': 'NA', 'pop': '527170.1981'} #> {'biweek': '4', 'year': '1906', 'loc': 'BALTIMORE', 'cases': 'NA', 'pop': '527347.0136'} ``` --- ## Using the `csv` module (cont.) Now, we can easily select rows based on their values in a specific column. For instance, to take only rows from Washington – i.e. where the column `loc` has the value "Washington": ```sh with open("../data/Dalziel2016_data.csv") as infile: reader = csv.DictReader(infile) for row in reader: if row['loc'] == "WASHINGTON": print(row) #> {'biweek': '14', 'year': '1940', 'loc': 'WASHINGTON', 'cases': '3', 'pop': '672643.7152'} #> {'biweek': '15', 'year': '1940', 'loc': 'WASHINGTON', 'cases': '3', 'pop': '673370.2202'} #> {'biweek': '16', 'year': '1940', 'loc': 'WASHINGTON', 'cases': '3', 'pop': '674095.5105'} #> ... ``` --- ## Using the `csv` module (cont.) Next, let's write the rows for Washington to a new, tab-delimited file: ```python with open("../data/Dalziel2016_data.csv") as infile: reader = csv.DictReader(infile) with open('Dalziel_Washington.csv', "w") as outfile: writer = csv.DictWriter(outfile, fieldnames = reader.fieldnames, delimiter = '\t') writer.writeheader() # Write the header row for row in reader: if row['loc'] == "WASHINGTON": writer.writerow(row) ``` - We can specify the delimiter to be a tab `\t` (default is a comma). - `writeheader()` and `writerow()` will write data from the file *as is*. --- ## <i class="fa fa-user-edit"></i> Intermezzo 3.4 Write code that prints (using the `print()` function) the values in the columns `loc` and `pop` for all rows in the file `Dalziel2016_data.csv`. --- ## <i class="fa fa-user-edit"></i> Intermezzo 3.4: solutions ```py import csv with open ("../data/Dalziel2016_data.csv") as fhandle: reader = csv.DictReader(fhandle) for row in reader: print(row["loc"], row["pop"]) ``` -- <br> .content-box-info[ Or to write this to a file: ```py with open("../data/Dalziel2016_data.csv") as infile: reader = csv.DictReader(infile) with open('Dalziel_loc-pop.csv', "w") as outfile: for row in reader: outfile.write(str(row['loc'] + ',' + row['pop'] + '\n')) ``` ] --- class: center middle inverse # Questions? ----- <br> <br> <br> <br> --- background-color: #f2f5eb ## Reading and writing - As another toy example, we can write the lines that we read from one file to another file: ```python with open('../data/Singh2015_data.csv', 'r') as input_file: with open('Singh_copy.csv', 'w') as output_file: for line in input_file: output_file.write(line) ``` - Now, we don't need to think about the `\n` characters: recall that reading methods *include* them, while writing methods "*need*" them. <br> ---- - When reading a file line-by-line, to go back to a specific line in this case the 1st line, we can use: ```python input_file.seek(0) ```