![]() ![]() We can think about multiple criteria we’d want for a data format: To help limit the scope of discussion, we’ll assume you’re using large datasets you write to disk in one internal process, then read the data later in one or more additional internal processes. The situation we’re considering: internal datasets If you’re using Pandas, you’re less likely to be doing this sort of processing. If you are streaming data over the network and want to process it row-by-row as it arrives, this implies a very different data format: you want something that makes for easy row-based parsing.ĬSV is actually pretty good at this, even though as we’ll see it’s otherwise an annoying format to work with. If someone is handing you a file, they control the format.Īnd if you will only ever process it once, changing the file format may not be worth the trouble. That is very situation-specific, so it’s difficult to give a universal answer. ![]() If you need to share data with other organizations, or even other teams within your organization, you need to limit yourself to data formats you know they will be able to process. “Best” is situation-specificĭifferent use cases imply different requirements. ![]() While there is no one true answer that works for everyone, this article will try to help you narrow down the field and make an informed decision. Some data formats do a better job at this than others. You also want to make sure the loaded data has all the right types: numeric types, datetimes, and so on. ![]() Ideally you’d want a file format that’s fast, efficient, small, and broadly supported. You don’t want loading the data to be slow, or use lots of memory: that’s pure overhead.There are plenty of data formats supported by Pandas, from CSV, to JSON, to Parquet, and many others as well. Before you can process your data with Pandas, you need to load it (from disk or remote storage). ![]()
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |