Excel files default to the *.xls
or *.xlsx
file formats (depending on the Excel version you’re using.) While this works great if you’re opening and re-opening the files in Excel, it might cause problems if you’re trying to open the files in another software program. When Excel saves a file to these formats, it includes special formatting that may not be able to be interpreted by another program. Therefore, when saving your data, especially if you’re going to use it in another program, it’s best to EXPORT the data into another file format.
This is even helpful if you’re using multiple versions of Excel. The formatting Excel has used over time has changed. And some old Excel files, opened in a new version of Excel, (or vice versa), aren’t interpreted properly and the data can be compromised.
One great format to use is the universal, open, and static format of tab- or common-delimited. These files are plain text files that separate the data from each column using a specific delimiter (in this case, a tab character or a comma). A *.csv
file can be imported or opened in a large variety of software programs, including Excel, R, SPSS, and even just a text editor.
To save a file you have opened in Excel into the *.csv
format:
*.csv
).By default, most coding and statistical environments expect UNIX-style line endings (\n
) as representing line breaks. However, Windows uses an alternate line ending signifier (\r\n
) by default for legacy compatibility with Teletype-based systems. As such, when exporting to CSV using Excel, your data will look like this:
data1,data2\r\n1,2\r\n4,5\r\n…
which, upon passing into most environments (which split on \n
), will parse as:
data1
data2\r
1
2\r
…
thus causing terrible things to happen to your data.
For example, 2\r
is not a valid integer, and thus will throw an error (if you’re lucky) when you attempt to operate on it in R or Python. Note that this happens on Excel for OSX as well as Windows, due to legacy Windows compatibility.
There are a handful of solutions for enforcing uniform UNIX-style line endings on your exported CSVs:
.git/config
file in your repository to automatically translate \r\n
line endings into \n
.Add the follwing to the file (see the detailed tutorial):
[filter "cr"]
clean = LC_CTYPE=C awk '{printf(\"%s\\n\", $0)}' | LC_CTYPE=C tr '\\r' '\\n'
smudge = tr '\\n' '\\r'`
and then create a file .gitattributes
that contains the line:
Previous: Basic quality control and data manipulation in spreadsheets. Next: Caveats of popular data and file formats.