Okay, so finally we're going to look at how we can create a data frame by referencing external data—data sitting outside of this workbook. To do that, I need to start with Power Query.
You may or may not be familiar with Power Query. Power Query is a tool that exists within Excel and Power BI, and it allows us to connect to external data, whether that be Excel workbooks, CSV files, databases, web pages, and lots of different sources, and actually bring that data into Excel or Power BI to use. There are other things we can also do with Power Query, such as making recorded transformations to data, but in this case, we just want to make a connection to some data.
To use Power Query, we can go to the Data tab within the ribbon and click "Get Data" here. If you've done my Power BI or Power Query Power Pivot course, then you will have seen me always go into the Power Query editor before connecting to data, but in this case, I'm just going to create a connection. I don't need to worry about using the editor. I'm just going to go to File and then from Excel Workbook, so I'm going to be connecting to an Excel workbook.
If I click that, then I'm going to be connecting to the course data. If I go to Import, a Navigator pops up, and it's basically showing us the contents of that workbook. This little symbol signifies a worksheet, so we've got lots of different worksheets in here, and then within the worksheets, we've got tables, signified by a different symbol. We want to connect to actual tables of data. At the start of the course, we introduced a table called Orders. If I click that, we get a bit of a preview of the data. We've got options to transform the data, so we could go into the Power Query editor, but like I said, we're not going to do that. Instead, we're going to load into Excel.
We're not going to click Load here; we're going to go to the drop-down and click Load To. This enables us to define how we want this loaded into Excel. By default, it's selected as a table, so if I were to go with that, it would actually create a table of data within the worksheet. That's not what I want. If we were connecting to big data, like 2 million rows, then this just wouldn't be an option because we can't fit a 2 million-row table into an Excel worksheet.
We're always going to use "Only Create Connection." If I click that and press OK, then this Queries and Connections pane opens. We can toggle this on and off in Data here. By clicking that, it disappears. I can bring it up again, and it says we have a connection only to this Orders table. This is a query, and so Power Query has connected to that data and is kind of just holding it there. It's not actually bringing it into a worksheet, so this isn't going to bloat the file size if we were to reference lots and lots of data. It's just going to create a connection to that.
Now, what we can do from there is I can firstly close that and then go to my cell here and enter some Python code. I'm going to call this variable df_orders, and then I'm going to use the Excel function again and a quotation mark. You can see this Orders exists here, and that is the connection that we have in the Excel file via Power Query. This Excel function can take ranges, cell ranges, or references to ranges of cells. It can take tables, queries, and names. So, in this case, this is a query, and so I can press Tab there and then Ctrl + Enter. There we have our data frame.
I can click the card to inspect it, and you can see it's telling us we have 21,209 rows of data and six columns. Because there's so much data, we can't see it all here. It by default shows us the first five rows and the last five rows. If I go here and press Shift + Ctrl + Alt + M, we can actually see all the data here in the array. I can navigate to the bottom with Ctrl + Down, and you can see we get down to row 21,209. If I press Shift + Ctrl + Up, then there we've got 21,209 rows, including the header. So, you can see we very easily brought in a lot of data.
Now, if you imagine this is 10 million rows of data that we've just connected to, sitting in some database, and we've used Power Query to connect to it, we can't or don't want to actually present that data in this array because, again, there's not enough space to fit it, and it would just bloat the file. We don't want to be bringing all that data in. The power of Power Query is that it lets us just create a connection to that data, and then we can reference that connection in the Python code and show it as a Python object. This data frame could consist of 10 million rows of data—really big data. Then, we could just leverage this variable in our code to actually work with that data.
This is one of the really powerful things about Microsoft integrating Python into Excel. It allows us to bring in data into Excel and then work with it using really powerful Python calculations without having to rely on the power of your laptop or machine, and instead leverage the Microsoft cloud.
That's an introduction to data frames. Like I say, they're super important, and we're going to be using them throughout the course. We're going to be looking at how we merge them, clean them, transform them, analyze them, and then actually visualize data from them.