Reading & Writing Data#
Storing data and the results of your calculations is important and common practice in scientific programming to disentangle the data creation and analysis. There are various ways to do so. Here, we discuss speficfically ASCII (TXT) and HDF5 formats.
ASCII Data#
Loading ASCII Data#
We will use NumPy’s genfromtxt() function for this:
import numpy as np
myData = np.genfromtxt(fname='./data/01_xy.dat',
delimiter=' ')
print( myData )
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[1], line 3
1 import numpy as np
----> 3 myData = np.genfromtxt(fname='./data/01_xy.dat',
4 delimiter=' ')
6 print( myData )
File /usr/local/lib/python3.14/site-packages/numpy/lib/_npyio_impl.py:1991, in genfromtxt(fname, dtype, comments, delimiter, skip_header, skip_footer, converters, missing_values, filling_values, usecols, names, excludelist, deletechars, replace_space, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, invalid_raise, max_rows, encoding, ndmin, like)
1989 fname = os.fspath(fname)
1990 if isinstance(fname, str):
-> 1991 fid = np.lib._datasource.open(fname, 'rt', encoding=encoding)
1992 fid_ctx = contextlib.closing(fid)
1993 else:
File /usr/local/lib/python3.14/site-packages/numpy/lib/_datasource.py:192, in open(path, mode, destpath, encoding, newline)
155 """
156 Open `path` with `mode` and return the file object.
157
(...) 188
189 """
191 ds = DataSource(destpath)
--> 192 return ds.open(path, mode, encoding=encoding, newline=newline)
File /usr/local/lib/python3.14/site-packages/numpy/lib/_datasource.py:529, in DataSource.open(self, path, mode, encoding, newline)
526 return _file_openers[ext](found, mode=mode,
527 encoding=encoding, newline=newline)
528 else:
--> 529 raise FileNotFoundError(f"{path} not found.")
FileNotFoundError: ./data/01_xy.dat not found.
Or without defining the delimiter:
myData = np.genfromtxt('./data/01_xy.dat')
print( myData )
[[0.298495 0.535602]
[0.26463 0.450345]
[0.328381 0.364419]
...
[0.851956 0.623053]
[0.296291 0.341577]
[0.271582 0.307172]]
Unpack the columns directly into two variables:
myCol1, myCol2 = np.genfromtxt('./data/01_xy.dat',
unpack=True)
print( myCol1 )
print( myCol2 )
[0.298495 0.26463 0.328381 ... 0.851956 0.296291 0.271582]
[0.535602 0.450345 0.364419 ... 0.623053 0.341577 0.307172]
Or via a simple transpose():
myCol1, myCol2 = np.genfromtxt('./data/01_xy.dat').transpose()
print( myCol1 )
print( myCol2 )
[0.298495 0.26463 0.328381 ... 0.851956 0.296291 0.271582]
[0.535602 0.450345 0.364419 ... 0.623053 0.341577 0.307172]
Sometimes we want to skip a few lines in the beginningm, as they might be header lines, etc.:
myCol1, myCol2 = np.genfromtxt('./data/01_xy.dat',
unpack=True,
skip_header=2)
print( myCol1 )
print( myCol2 )
[0.328381 0.189954 0.422187 ... 0.851956 0.296291 0.271582]
[0.364419 0.438772 0.442186 ... 0.623053 0.341577 0.307172]
Saving ASCII Data#
Generate some 1D test data:
testData1D = np.arange(0,20)
print( testData1D )
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
Save the 1D test data using NumPy’s savetxt() function:
np.savetxt('./data/test/01_test_data_1D.dat',
testData1D)
… in a gnu-zipped way:
np.savetxt('./data/test/01_test_data_1D.gz',
testData1D)
… which can also be read in again:
testData1DZipped = np.genfromtxt('./data/test/01_test_data_1D.gz')
print( testData1DZipped )
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
18. 19.]
Simple data sanity (or “data integrity”) check:
print("Difference between original and saved data: ",
np.sum( testData1D - testData1DZipped ) )
Difference between original and saved data: 0.0
This also works with 2D data:
testData2D = np.arange(0,20).reshape(10,2)
print( testData2D )
[[ 0 1]
[ 2 3]
[ 4 5]
[ 6 7]
[ 8 9]
[10 11]
[12 13]
[14 15]
[16 17]
[18 19]]
np.savetxt('./data/test/01_test_data_2D.dat',
testData2D)
Self-Defined ASCII:#
For more information on string formating see:
# open a file for writing (replace existing)
fid = open('./data/test/01_test_data_2D.mydat', mode='w')
for col1, col2 in testData2D:
fid.write( f" {col1:d} | {col2:d} \n" )
fid.close()
Or as floats:
# open a file for writing (replace existing)
fid = open('./data/test/01_test_data_2D.mydat', mode='w')
for col1, col2 in testData2D:
fid.write(f" {col1:6.3f} | {col2:6.3f} \n")
fid.close()
Hierarchical Data Format (HDF)#
The following is taken from h5py. Start there for further details
HDF5 lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays.
Thousands of datasets can be stored in a single file, categorized and tagged however you want.
An HDF5 file is a container for two kinds of objects: datasets, which are array-like collections of data, and groups, which are folder-like containers that hold datasets and other groups.
Note
Groups work like dictionaries, and datasets work like NumPy arrays.
We start with loading the h5py module and with creating some 2D test data:
import h5py
testData2D = np.arange(0,20).reshape(10,2)
print(testData2D)
[[ 0 1]
[ 2 3]
[ 4 5]
[ 6 7]
[ 8 9]
[10 11]
[12 13]
[14 15]
[16 17]
[18 19]]
We can write hdf5 archives with the with ... as statement and create_dataset(...):
with h5py.File('./data/test/01_test.hdf5', 'w') as myFile:
myFile.create_dataset("myData", data=testData2D)
And we can load the data in a very similar way:
with h5py.File('./data/test/01_test.hdf5', 'r') as myFile:
testData2DHDF5 = myFile["myData"]
print("Difference between original and saved data: ",
np.sum( testData2D - testData2DHDF5 ) )
Difference between original and saved data: 0
Note
We can store different kind of data in our hdf5 archives!
testData2DA = np.arange(0,20).reshape(10,2)
testData2DB = np.arange(0,60).reshape(10,2,3)
testData2DC = np.arange(0,10)
with h5py.File('./data/test/02_test.hdf5', 'w') as myFile:
myFile.create_dataset("myDataA", data=testData2DA)
myFile.create_dataset("myDataB", data=testData2DB)
myFile.create_dataset("myDataC", data=testData2DC)
For loading we can also use the handle returned by h5py.File(...).
Warning
In this case we should never forget to close the hdf5 archive again.
myFile = h5py.File('./data/test/02_test.hdf5', 'r')
print( myFile )
myFile.close()
<HDF5 file "02_test.hdf5" (mode r)>
How do we know what’s saved in our hdf5 archive?
Method A: using IPython Magics to execute commands in the terminal:
!h5dump -H ./data/test/02_test.hdf5
HDF5 "./data/test/02_test.hdf5" {
GROUP "/" {
DATASET "myDataA" {
DATATYPE H5T_STD_I64LE
DATASPACE SIMPLE { ( 10, 2 ) / ( 10, 2 ) }
}
DATASET "myDataB" {
DATATYPE H5T_STD_I64LE
DATASPACE SIMPLE { ( 10, 2, 3 ) / ( 10, 2, 3 ) }
}
DATASET "myDataC" {
DATATYPE H5T_STD_I64LE
DATASPACE SIMPLE { ( 10 ) / ( 10 ) }
}
}
}
Mehtod B: Directly within Python:
myFile = h5py.File('./data/test/02_test.hdf5', 'r')
print( myFile.keys() )
myFile.close()
<KeysViewHDF5 ['myDataA', 'myDataB', 'myDataC']>
As soon as we know the keys, we can access the data directly:
myFile = h5py.File('./data/test/02_test.hdf5', 'r')
print( myFile['myDataA'] )
print( myFile.get('myDataA') )
myFile.close()
<HDF5 dataset "myDataA": shape (10, 2), type "<i8">
<HDF5 dataset "myDataA": shape (10, 2), type "<i8">
myFile = h5py.File('./data/test/02_test.hdf5', 'r')
print( myFile['myDataA'] )
print( myFile['myDataB'] )
print( myFile['myDataC'] )
myFile.close()
<HDF5 dataset "myDataA": shape (10, 2), type "<i8">
<HDF5 dataset "myDataB": shape (10, 2, 3), type "<i8">
<HDF5 dataset "myDataC": shape (10,), type "<i8">
Ok, but how do we really get the data?
myFile = h5py.File('./data/test/02_test.hdf5', 'r')
print( np.array(myFile['myDataA']) )
print( np.array(myFile['myDataB']) )
print( np.array(myFile['myDataC']) )
myFile.close()
[[ 0 1]
[ 2 3]
[ 4 5]
[ 6 7]
[ 8 9]
[10 11]
[12 13]
[14 15]
[16 17]
[18 19]]
[[[ 0 1 2]
[ 3 4 5]]
[[ 6 7 8]
[ 9 10 11]]
[[12 13 14]
[15 16 17]]
[[18 19 20]
[21 22 23]]
[[24 25 26]
[27 28 29]]
[[30 31 32]
[33 34 35]]
[[36 37 38]
[39 40 41]]
[[42 43 44]
[45 46 47]]
[[48 49 50]
[51 52 53]]
[[54 55 56]
[57 58 59]]]
[0 1 2 3 4 5 6 7 8 9]
Let’s have a look at another example using create_group(), i.e. the “sub-directories”:
# generate some test data
testData2DA = np.arange(0,20).reshape(10,2)
testData2DB = np.arange(0,60).reshape(10,2,3)
testData2DC = np.arange(0,10)
with h5py.File('./data/test/03_test.hdf5', 'w') as myFile:
g1 = myFile.create_group('group1')
g1.create_dataset("myDataA", data=testData2DA)
g1.create_dataset("myDataB", data=testData2DB)
g2 = myFile.create_group('group2')
g2.create_dataset("myDataC", data=testData2DC)
!h5dump -H ./data/test/03_test.hdf5
HDF5 "./data/test/03_test.hdf5" {
GROUP "/" {
GROUP "group1" {
DATASET "myDataA" {
DATATYPE H5T_STD_I64LE
DATASPACE SIMPLE { ( 10, 2 ) / ( 10, 2 ) }
}
DATASET "myDataB" {
DATATYPE H5T_STD_I64LE
DATASPACE SIMPLE { ( 10, 2, 3 ) / ( 10, 2, 3 ) }
}
}
GROUP "group2" {
DATASET "myDataC" {
DATATYPE H5T_STD_I64LE
DATASPACE SIMPLE { ( 10 ) / ( 10 ) }
}
}
}
}
myFile = h5py.File('./data/test/03_test.hdf5', 'r')
print( myFile.keys() )
print( myFile['group1/myDataA'] )
print( myFile['group1/myDataB'] )
print( myFile['group2/myDataC'] )
myFile.close() # don't forget to close!
<KeysViewHDF5 ['group1', 'group2']>
<HDF5 dataset "myDataA": shape (10, 2), type "<i8">
<HDF5 dataset "myDataB": shape (10, 2, 3), type "<i8">
<HDF5 dataset "myDataC": shape (10,), type "<i8">
with h5py.File('./data/test/03_test.hdf5', 'r') as myFile:
# get keys
print( 'keys:', myFile.keys() )
print( 'group 1 keys:', myFile['group1'].keys() )
print( 'group 2 keys:', myFile['group2'].keys() )
# access data
print( myFile['group1/myDataA'] )
print( myFile['group1/myDataB'] )
print( myFile['group2']['myDataC'] ) # Note the difference
myDataA = np.array(myFile['group1']['myDataA'])
myDataB = np.array(myFile['group1']['myDataB'])
myDataC = np.array(myFile['group2']['myDataC'])
keys: <KeysViewHDF5 ['group1', 'group2']>
group 1 keys: <KeysViewHDF5 ['myDataA', 'myDataB']>
group 2 keys: <KeysViewHDF5 ['myDataC']>
<HDF5 dataset "myDataA": shape (10, 2), type "<i8">
<HDF5 dataset "myDataB": shape (10, 2, 3), type "<i8">
<HDF5 dataset "myDataC": shape (10,), type "<i8">
Warning
Depending on the choosen hdf5 “driver” your data might not be availble anymore after closing the file!
with h5py.File('./data/test/03_test.hdf5', 'r') as myFile:
myDataA = myFile['group1']['myDataA']
print(myDataA)
<Closed HDF5 dataset>
One way to solve this is to convert your hdf5 “pointer” to an actual NumPy array:
with h5py.File('./data/test/03_test.hdf5', 'r') as myFile:
myDataA = np.array( myFile['group1']['myDataA'] )
print(myDataA)
[[ 0 1]
[ 2 3]
[ 4 5]
[ 6 7]
[ 8 9]
[10 11]
[12 13]
[14 15]
[16 17]
[18 19]]
Other Formats#
Numpy Arrays:#
see also:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
https://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html
# generate some 2D test data
testData2D = np.arange(0,20).reshape(10,2)
# save it
np.save( './data/test/01_test_data_2D.npy', testData2D )
# and load it again
testData2DNumpy = np.load( './data/test/01_test_data_2D.npy' )
# compare it
print( "Difference between original and saved data: ", np.sum( testData2D - testData2DNumpy ) )
Pickle:#
# load pickle module
import pickle
# generate some 2D test data
testData2D = np.arange(0,20).reshape(10,2)
# save it
pickle.dump( testData2D, open('./data/test/01_test_data_2D.pickle', 'wb') )
# and load it again
testData2DPickle = pickle.load( open( './data/test/01_test_data_2D.pickle', 'rb' ) )
# compare it
print( "Difference between original and saved data: ", np.sum( testData2D - testData2DPickle ) )